1668163 – LVM cache cannot flush buffer,change cache type or lvremove LV (CachePolicy 'cleaner' also doesn't work)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1668163 - LVM cache cannot flush buffer,change cache type or lvremove LV (CachePolicy 'cleaner' also doesn't work)

Summary: LVM cache cannot flush buffer,change cache type or lvremove LV (CachePolicy '...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Zdenek Kabelac
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-22 05:46 UTC by Strahil Nikolov
Modified:	2023-06-19 20:18 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-18 16:35:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
strace output of change of cache to writethrough (437.02 KB, text/plain) 2019-01-22 05:46 UTC, Strahil Nikolov	no flags	Details
strace of lvconvert uncache (437.05 KB, text/plain) 2019-01-22 05:47 UTC, Strahil Nikolov	no flags	Details
bonnie++ results with different chunk sizes (7.40 KB, application/zip) 2019-01-30 12:52 UTC, Strahil Nikolov	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
CentOS	0015729	0	None	None	None	2019-01-22 05:46:40 UTC

Description Strahil Nikolov 2019-01-22 05:46:40 UTC

Created attachment 1522331 [details]
strace output of change of cache to writethrough

Description of problem:
LVM cache in writeback mode cannot be changed to writethrough, nor it can be flushed to data_corig lv (endless loop).
Setting the CachePolicy 'cleaner' doesn't force lvm cache to be flushed.
lvremove fails also due to endless flush buffer

Version-Release number of selected component (if applicable):
libblockdev-lvm-2.18-3.el7.x86_64
lvm2-2.02.180-10.el7_6.2.x86_64
lvm2-libs-2.02.180-10.el7_6.2.x86_64
udisks2-lvm2-2.7.3-8.el7.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Follow official documentation for creation of lvm cache (use '--cachemode writeback')
2. Fill in some data - you will notice that cache_dirty_blocks won't get flashed
3. Set CachePolicy to 'cleaner'
4. Try to change the cache mode from 'writeback' to 'writethrough':
lvchange --cachemode writethrough VG/LV
5. Try to uncache:
lvconvert --uncache VG/LV
6. Try to remove LV:
lvremove VG/LV

Actual results:
cache_dirty_blocks are not flushed to data_corig lv, an endless loop of 'Flushing xxxx blocks for cache VG/LV.' is being reported,but the count doesn't drop nor ever finishes.
Cleaner policy fails to flush the cache.
Lvremove cannot remove the LV.

Expected results:
All used blocks to be flushed during cache change/ uncache operations and during lv removal.

Additional info:
Other OS reports:
https://www.reddit.com/r/archlinux/comments/9owc15/lvm_cache_not_flushing_after_unclean_shutdown/ 
https://marc.info/?l=linux-lvm&m=152948734523317&w=2

Comment 2 Strahil Nikolov 2019-01-22 05:47:24 UTC

Created attachment 1522332 [details]
strace of lvconvert uncache

Comment 3 Zdenek Kabelac 2019-01-22 09:04:31 UTC

Please provide basic info about cache chunksize   (lvs -o+chunksize)

The assumption could be -  the cache chunksize is >= 1MiB and there is unspecified bigger migration_threshold and remained at defaul value 2048  (1MiB).

This prevents kernel from flush blockes.

There are several bugs about this.


The quick workaround solution is to set higher threshold:

lvchange --cachesettings migration_threshold=16384 vg/cacheLV



There are already upstream patches where lvm2 guards this settings now and does ensure threshold is at least 8 chunks bigs.

https://www.redhat.com/archives/lvm-devel/2019-January/msg00032.html
https://www.redhat.com/archives/lvm-devel/2019-January/msg00031.html

Comment 4 Strahil Nikolov 2019-01-22 14:29:14 UTC

Thanks for your fast reply.

The chunk size was quite large:
[root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data 
  LV              CachePolicy CacheSettings             Chunk  CacheUsedBlocks  CacheDirtyBlocks
  gluster_lv_data cleaner     migration_threshold=16384 <9.41m             2332             2332

Once I set the migration_threshold to 32768, the cleaner policy immediately started flushing the buffer:

[root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data 
  LV              CachePolicy CacheSettings             Chunk  CacheUsedBlocks  CacheDirtyBlocks
  gluster_lv_data cleaner     migration_threshold=32768 <9.41m             2332             2304
[root@ovirt2 ~]# lvs -o name,cache_policy,cache_settings,chunk_size,cache_used_blocks,cache_dirty_blocks /dev/gluster_vg_md0/gluster_lv_data 
  LV              CachePolicy CacheSettings             Chunk  CacheUsedBlocks  CacheDirtyBlocks
  gluster_lv_data cleaner     migration_threshold=32768 <9.41m             2332             2271

It's nice to know that this is a known problem and it's being worked on it.

Maybe it's a good candidate for a Red Hat Solution ?

Thanks for the hint with the migration_threshold. Didn't know it affects caches.

Best Regards,
Strahil Nikolov

Comment 5 Zdenek Kabelac 2019-01-29 10:26:47 UTC

Hi

If it wouldn't be big problem - 

For now we try upstream to select default chunksize as power-or-2 - from  performance reasons it's like better to stick with power-of-2 size even if the chunk itself will likely be bigger  i.e.  8M chunks  -> 16M chunks

https://www.redhat.com/archives/lvm-devel/2019-January/msg00053.html

Would be probably nice to have some testing trials whether this make sense - or would should rather focus on maybe  1MiB or 512KiB  boundaries?

Would it be a big problem to perform some testings what are the best results ?

i.e. is chunksize  9.5MiB  better  -or  10MiB ? or 16MiB ?

You could probably easily try yourself without applying any patch -  just when creating a cache - set  chunksize accordingly:


-c 9728K
-c 10240K
-c 16384K

What are states for underlying caching device in-use:

i.e.
grep "" /sys/block/sda/queue/*

Comment 6 Strahil Nikolov 2019-01-30 12:52:36 UTC

Created attachment 1525028 [details]
bonnie++ results with different chunk sizes

Comment 7 Strahil Nikolov 2019-01-30 12:53:42 UTC

I have made some testing with bonnie++ and using a SATAIII SSD.
The device in use :
[root@ovirt1 ~]# grep "" /sys/block/sda/queue/*
/sys/block/sda/queue/add_random:0
/sys/block/sda/queue/discard_granularity:512
/sys/block/sda/queue/discard_max_bytes:2147450880
/sys/block/sda/queue/discard_zeroes_data:1
/sys/block/sda/queue/hw_sector_size:512
grep: /sys/block/sda/queue/iosched: Is a directory
/sys/block/sda/queue/iostats:1
/sys/block/sda/queue/logical_block_size:512
/sys/block/sda/queue/max_hw_sectors_kb:32767
/sys/block/sda/queue/max_integrity_segments:0
/sys/block/sda/queue/max_sectors_kb:512
/sys/block/sda/queue/max_segments:168
/sys/block/sda/queue/max_segment_size:65536
/sys/block/sda/queue/minimum_io_size:4096
/sys/block/sda/queue/nomerges:0
/sys/block/sda/queue/nr_requests:128
/sys/block/sda/queue/optimal_io_size:0
/sys/block/sda/queue/physical_block_size:4096
/sys/block/sda/queue/read_ahead_kb:128
/sys/block/sda/queue/rotational:0
/sys/block/sda/queue/rq_affinity:1
/sys/block/sda/queue/scheduler:[noop] deadline cfq
/sys/block/sda/queue/unpriv_sgio:0
/sys/block/sda/queue/write_same_max_bytes:0



Note: The SSD was removed from the VG and "blkdiscard"-ed before each test.Sadly I didn't have enough time - so only a single test was performed.
It seems that the maximum chunks for a cpool is 1,000,000 - so with larger devices a larger chunk size will be needed.

Comment 8 Zdenek Kabelac 2019-01-30 13:59:42 UTC

Hmm interesting - unsure how bonnie++ is consistent in those results - but there is quite interesting that 2MiB somehow 'break' the line,
but it looks like cache chunks above 512K are not really all that much useful - as they possibly do not fully use capabilities of caching drive.

The limit of 1.000.000 chunks is ATM pretty strongly enforce - and origins come from age before V2 format support for cache were start and
shutdown could take significant amount of time. The other reason is - the cache that maintains more chunks will eat quite some portion of RAM,
but typically if people can afford TiB for caching - RAM is almost never an issue as well.

So ATM - user CAN raise this limit in lvm.conf allocation/cache_pool_max_chunks currently on your own.

It needs some examination what borders are more pricey then others.
It seems to look like 16M chunks aren't productive for some workloads.

I don't have good answers to these questions.

ATM I'd probably recommend in your case to try to go just with 512G cache with 512K chunks - and see if it doesn't work actually better/faster this way, that with bigger cache but bigger chunks.

It also could be (if you have lot of RAM) raising limit to 2 or maybe even 4M chunks per cache also result in still acceptable memory occupancy and lets you use bigger cache pool size with 512K chunks size.

It would be nice to share some result of usability and performance after couple days of use.

(since hotspot cache takes a while to 'optimize' for your device's hotspot)

Comment 9 Strahil Nikolov 2019-01-30 14:20:32 UTC

I forgot to mention what flags were used: 
bonnie++ -d /mnt -s 50G -n 0 -m TEST-chunk_size_k_chunk-writeback -f -b  -u root

In my case the systems will be a Hyperconverged oVirt with GlusterFS lab and the cache available will be around 150G.
As the gluster will be limited by the network (sadly the workstations have only 1 Gbit/s interfaces)I cannot expect writes above 123GiB/s , but I have noticed that the reads are quite good.

Do you think that Chunk size of 4M can lead to better reads (generally speaking) when having in mind that most SSDs report 4096 bytes optimal ?

This one is from my SSD:
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Comment 10 Zdenek Kabelac 2019-01-31 15:46:34 UTC

Chunk size should be likely matching best IO for SSD and also not be disruptive for your 'origin' LV  i.e. if it would be raid array - it should be following whole stripe so there are no additional  read-modify-write operations involved - but this is tricky to get it right since many drives provide rather random stuff for minimal and optimal io sizes.

So it would be good to see comparative table - if you i.e. raise the bar of max chunks - how much less memory you can use - and does it pay of possibly bigger performance - but I surely not this is longer-term evaluation (as the hot-spot cache takes its time to tune well).

It's IMHO probably better to go with 'smaller' ones - as small as 512K or 1M
From your charts it seemed for many workloads to be likely degraded already with 4M size.

Also - maybe not using all the size for cache - so it's actually not  that big - but have those 1 million 512K chunks    - can give you same performance and you can use remaining space for something else ?

Comment 11 Strahil Nikolov 2019-10-23 03:43:14 UTC

I've used the whole size , but set the chunk size to 1M.

Comment 12 Zdenek Kabelac 2020-11-18 16:35:01 UTC

This has been already fixed with lvm2 version >= 2.02.184

lvm2 ensures the migration_threshold is always at least as big as 8 chunks.

Note You need to log in before you can comment on or make changes to this bug.