Red Hat Bugzilla – Bug 990583
lvremove of thin snapshots takes 5 to 20 minutes (single core cpu bound?)
Last modified: 2017-09-15 06:57:24 EDT
Description of problem:
lvremove of thin snapshot takes 5 to 20 minutes.
one "main" volume, several (here: 28) periodic 1st level snapshots,
usually at most one 2nd level snapshot.
new snap is created (usually a couple of seconds),
then the oldest snap is removed.
application is quiescent at the point in time.
meta volume is on fusionio drives in this case,
so should not be an IO problem, really.
Rather looks like something in-kernel is single core cpu bound.
db bu4db Vwi-aotz 1.10t pool1 87.23
db.20130727.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.11
db.20130727.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12
db.20130727.150002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12
db.20130727.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12
db.20130727.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12
db.20130728.030003.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13
db.20130728.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13
db.20130728.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13
db.20130728.150002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13
db.20130728.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13
db.20130728.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13
db.20130729.030007.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14
db.20130729.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14
db.20130729.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14
db.20130729.150003.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14
db.20130729.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14
db.20130729.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.16
db.20130730.030001.raw bu4db Vwi-a-tz 1.10t pool1 db 87.16
db.20130730.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.17
db.20130730.110003.raw bu4db Vwi-a-tz 1.10t pool1 db 87.19
db.20130730.150002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.19
db.20130730.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.22
db.20130730.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.22
db.20130731.030002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.23
db.20130731.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.23
db.20130731.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.23
pool1 bu4db twi-a-tz 1.95t 74.71
Version-Release number of selected component (if applicable):
LVM version: 2.02.98(2)-RHEL6 (2012-10-15)
Library version: 1.02.77-RHEL6 (2012-10-15)
Driver version: 4.22.6
same behaviour with rhel 6.3 -279* and 6.4 -358* kernels.
our logfiles consistently show lvremove times of about 300 to 1200 seconds,
with an average of 560 seconds for the last 200 iterations
(more logs available).
Steps to Reproduce:
What we do:
1. create and fill a thin volume
2. keep filling it with random IO (it's a postgres on top of this)
3. keep creating a number of snapshots every few hours
4. remove the oldest snapshot once you reach ~ 20 or 30 snapshots
or fill percent of meta or pool suggests removal would be a good idea.
lvremove seems to be single core (in-kernel?) cpu-bound.
don't crunch so much, or at least use more than one core?
note: we use 64k chunk size, so the btree is probably "massive".
better behaviour with fewer snapshots, or lower metadata_percent.
I have several such systems where I could monitor
whatever system parameters you may be interested in.
however do "arbitrary experiments": not so much.
depending on what exactly you suggest ;-)
To be clear, at this point this is not a real problem,
the system would sit mostly idle for a few hours, anyways,
until the next snapshot iteration.
But it is a "major annoyance".
Anything we can do to help nail down the actual bottleneck: let me know.
I guess it would be nice to have attached compressed thin pool kernel metadata and related lvm2 metadata. What size are they compressed?
(size of thin-pool metadata is not visible in lvs output (needs '-a'))
Please attach 'lvs -a -o+metadata_percent'
May I assume all the volumes are active when you run lvremove ?
(Since removal of inactive thin volume when thin pool is inactive would trigger 2 metadata consistency checks - when the pool is activated and deactivated).
Also please try to attach memory usage 'slabtop -o' and 'cat /proc/vmstat'
Also - 'perf record -ag' & 'perf report' of long running lvremove command would help.
Created attachment 783260 [details]
Created attachment 783261 [details]
Created attachment 783262 [details]
in the original post: metadata_percent: 24.02
in our usage, it seems to vary between 23 and 25 %
pool1 bu4db twi-a-tz 1.95t 71.43 23.42
[pool1_tdata] bu4db Twi-aot- 1.95t
[pool1_tmeta] bu4db ewi-aot- 16.00g
all volumes are "active" (mappings exist)
only one volume is "open" (the "main" volume is mounted)
no application IO at that time.
I don't see any thin_check hanging around, only
10854 root 20 0 23220 4468 3312 R 99.0 0.0 6:46.28 lvremove -f bu4db/db.20130802.030003.raw
meaning 99% cpu (sys) since almost 7 minutes ...
slabtop, vmstat, meminfo attached just now.
Regarding the meta data dump,
we talk about ~ 1.6 GB bzip2 (raw dd'ed tmeta volume)
I'll also try to produce some perf output.
Due do involved communication overhead [ :-) ]
this may take a few days, though.
Created attachment 783547 [details]
while ! pid=$(pgrep lvremove); do sleep 1; done ; perf record -ag -p $pid
full perf.data available,
or different perf report options, if you like.
Or maybe it is useful enough as it is?
Patches went in in October 2014 to prefetch metadata which greatly speeds up this situation.
uhm, it was *CPU* limitted. 100% actual cpu usage on a single core.
NOT IO-limited at all.
So how can prefetch of meta data IO "greatly speed up"
something that is cpu bound?
My guess is the underlying issue is still to do with metadata paging in and out due to the very large thin device.
You're using a very fast device for the metadata, so the IO delay will be minimal. But every time a metadata page is read or written a checksum of it is calculated. Try tweaking the bufio parameters to increase the cache size.