Description of problem: lvremove of thin snapshot takes 5 to 20 minutes. Use case: one "main" volume, several (here: 28) periodic 1st level snapshots, usually at most one 2nd level snapshot. new snap is created (usually a couple of seconds), then the oldest snap is removed. application is quiescent at the point in time. meta volume is on fusionio drives in this case, so should not be an IO problem, really. Rather looks like something in-kernel is single core cpu bound. db bu4db Vwi-aotz 1.10t pool1 87.23 db.20130727.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.11 db.20130727.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12 db.20130727.150002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12 db.20130727.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12 db.20130727.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.12 db.20130728.030003.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13 db.20130728.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13 db.20130728.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13 db.20130728.150002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13 db.20130728.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13 db.20130728.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.13 db.20130729.030007.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14 db.20130729.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14 db.20130729.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14 db.20130729.150003.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14 db.20130729.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.14 db.20130729.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.16 db.20130730.030001.raw bu4db Vwi-a-tz 1.10t pool1 db 87.16 db.20130730.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.17 db.20130730.110003.raw bu4db Vwi-a-tz 1.10t pool1 db 87.19 db.20130730.150002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.19 db.20130730.190002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.22 db.20130730.230002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.22 db.20130731.030002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.23 db.20130731.070002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.23 db.20130731.110002.raw bu4db Vwi-a-tz 1.10t pool1 db 87.23 pool1 bu4db twi-a-tz 1.95t 74.71 metadata_percent: 24.02 Version-Release number of selected component (if applicable): LVM version: 2.02.98(2)-RHEL6 (2012-10-15) Library version: 1.02.77-RHEL6 (2012-10-15) Driver version: 4.22.6 same behaviour with rhel 6.3 -279* and 6.4 -358* kernels. How reproducible: very. our logfiles consistently show lvremove times of about 300 to 1200 seconds, with an average of 560 seconds for the last 200 iterations (more logs available). Steps to Reproduce: What we do: 1. create and fill a thin volume 2. keep filling it with random IO (it's a postgres on top of this) 3. keep creating a number of snapshots every few hours 4. remove the oldest snapshot once you reach ~ 20 or 30 snapshots or fill percent of meta or pool suggests removal would be a good idea. Actual results: lvremove seems to be single core (in-kernel?) cpu-bound. Expected results: don't crunch so much, or at least use more than one core? Additional info: note: we use 64k chunk size, so the btree is probably "massive". better behaviour with fewer snapshots, or lower metadata_percent. I have several such systems where I could monitor whatever system parameters you may be interested in. however do "arbitrary experiments": not so much. depending on what exactly you suggest ;-) To be clear, at this point this is not a real problem, the system would sit mostly idle for a few hours, anyways, until the next snapshot iteration. But it is a "major annoyance". Anything we can do to help nail down the actual bottleneck: let me know. Thanks, Lars
I guess it would be nice to have attached compressed thin pool kernel metadata and related lvm2 metadata. What size are they compressed? (size of thin-pool metadata is not visible in lvs output (needs '-a')) Please attach 'lvs -a -o+metadata_percent' May I assume all the volumes are active when you run lvremove ? (Since removal of inactive thin volume when thin pool is inactive would trigger 2 metadata consistency checks - when the pool is activated and deactivated). Also please try to attach memory usage 'slabtop -o' and 'cat /proc/vmstat' Also - 'perf record -ag' & 'perf report' of long running lvremove command would help.
Created attachment 783260 [details] slabtop
Created attachment 783261 [details] vmstat
Created attachment 783262 [details] meminfo
in the original post: metadata_percent: 24.02 in our usage, it seems to vary between 23 and 25 % Right now: pool1 bu4db twi-a-tz 1.95t 71.43 23.42 [pool1_tdata] bu4db Twi-aot- 1.95t [pool1_tmeta] bu4db ewi-aot- 16.00g all volumes are "active" (mappings exist) only one volume is "open" (the "main" volume is mounted) no application IO at that time. I don't see any thin_check hanging around, only 10854 root 20 0 23220 4468 3312 R 99.0 0.0 6:46.28 lvremove -f bu4db/db.20130802.030003.raw meaning 99% cpu (sys) since almost 7 minutes ... slabtop, vmstat, meminfo attached just now. Regarding the meta data dump, we talk about ~ 1.6 GB bzip2 (raw dd'ed tmeta volume) I'll also try to produce some perf output. Due do involved communication overhead [ :-) ] this may take a few days, though.
Created attachment 783547 [details] perf-report grabbed by: while ! pid=$(pgrep lvremove); do sleep 1; done ; perf record -ag -p $pid full perf.data available, or different perf report options, if you like. Or maybe it is useful enough as it is?
Patches went in in October 2014 to prefetch metadata which greatly speeds up this situation.
uhm, it was *CPU* limitted. 100% actual cpu usage on a single core. NOT IO-limited at all. So how can prefetch of meta data IO "greatly speed up" something that is cpu bound? Just curious...
My guess is the underlying issue is still to do with metadata paging in and out due to the very large thin device. You're using a very fast device for the metadata, so the IO delay will be minimal. But every time a metadata page is read or written a checksum of it is calculated. Try tweaking the bufio parameters to increase the cache size.