Bug 990583

Summary: lvremove of thin snapshots takes 5 to 20 minutes (single core cpu bound?)
Product: Red Hat Enterprise Linux 6 Reporter: Lars Ellenberg <lars.ellenberg>
Component: kernelAssignee: Joe Thornber <thornber>
Status: CLOSED WONTFIX QA Contact: Storage QE <storage-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4CC: agk, dwysocha, heinzm, jbrassow, lars.ellenberg, msnitzer, prajnoha, prockai, thornber, zkabelac
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-15 10:57:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
slabtop
none
vmstat
none
meminfo
none
perf-report none

Description Lars Ellenberg 2013-07-31 13:38:12 UTC
Description of problem:
lvremove of thin snapshot takes 5 to 20 minutes.

Use case:
one "main" volume, several (here: 28) periodic 1st level snapshots,
usually at most one 2nd level snapshot.

new snap is created (usually a couple of seconds),
then the oldest snap is removed.

application is quiescent at the point in time.

meta volume is on fusionio drives in this case,
so should not be an IO problem, really.
Rather looks like something in-kernel is single core cpu bound.

  db                     bu4db  Vwi-aotz   1.10t pool1         87.23
  db.20130727.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.11
  db.20130727.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130727.150002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130727.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130727.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130728.030003.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.150002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130729.030007.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.150003.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.16
  db.20130730.030001.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.16
  db.20130730.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.17
  db.20130730.110003.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.19
  db.20130730.150002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.19
  db.20130730.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.22
  db.20130730.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.22
  db.20130731.030002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.23
  db.20130731.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.23
  db.20130731.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.23
  pool1                  bu4db  twi-a-tz   1.95t               74.71

metadata_percent: 24.02

Version-Release number of selected component (if applicable):
  LVM version:     2.02.98(2)-RHEL6 (2012-10-15)
  Library version: 1.02.77-RHEL6 (2012-10-15)
  Driver version:  4.22.6

same behaviour with rhel 6.3  -279* and 6.4 -358* kernels.


How reproducible:
very.
our logfiles consistently show lvremove times of about 300 to 1200 seconds,
with an average of 560 seconds for the last 200 iterations
(more logs available).


Steps to Reproduce:
What we do:
1. create and fill a thin volume
2. keep filling it with random IO (it's a postgres on top of this)
3. keep creating a number of snapshots every few hours
4. remove the oldest snapshot once you reach ~ 20 or 30 snapshots
   or fill percent of meta or pool suggests removal would be a good idea.

Actual results:
lvremove seems to be single core (in-kernel?) cpu-bound.

Expected results:
don't crunch so much, or at least use more than one core?

Additional info:
note: we use 64k chunk size, so the btree is probably "massive".

better behaviour with fewer snapshots, or lower metadata_percent.

I have several such systems where I could monitor
whatever system parameters you may be interested in.
however do "arbitrary experiments": not so much.
depending on what exactly you suggest ;-)

To be clear, at this point this is not a real problem,
the system would sit mostly idle for a few hours, anyways,
until the next snapshot iteration.

But it is a "major annoyance".

Anything we can do to help nail down the actual bottleneck: let me know.

Thanks,

    Lars

Comment 2 Zdenek Kabelac 2013-08-02 09:28:24 UTC
I guess it would be nice to have attached compressed thin pool kernel metadata and related lvm2 metadata.  What size are they compressed?
(size of thin-pool metadata is not visible in lvs output (needs '-a'))

Please attach 'lvs -a -o+metadata_percent'

May I assume all the volumes are active when you run lvremove ?
(Since removal of inactive thin volume when thin pool is inactive would trigger 2 metadata consistency checks - when the pool is activated and deactivated).


Also please try to attach memory usage  'slabtop -o' and 'cat /proc/vmstat'


Also - 'perf record -ag' & 'perf report' of long running lvremove command would help.

Comment 3 Lars Ellenberg 2013-08-06 10:43:39 UTC
Created attachment 783260 [details]
slabtop

Comment 4 Lars Ellenberg 2013-08-06 10:44:11 UTC
Created attachment 783261 [details]
vmstat

Comment 5 Lars Ellenberg 2013-08-06 10:44:43 UTC
Created attachment 783262 [details]
meminfo

Comment 6 Lars Ellenberg 2013-08-06 10:46:20 UTC
in the original post: metadata_percent: 24.02
in our usage, it seems to vary between 23 and 25 %

Right now:
  pool1                  bu4db  twi-a-tz   1.95t  71.43  23.42
  [pool1_tdata]          bu4db  Twi-aot-   1.95t
  [pool1_tmeta]          bu4db  ewi-aot-  16.00g


all volumes are "active" (mappings exist)
only one volume is "open" (the "main" volume is mounted)
no application IO at that time.

I don't see any thin_check hanging around, only
  10854 root       20   0 23220  4468  3312 R 99.0  0.0  6:46.28 lvremove -f bu4db/db.20130802.030003.raw

meaning 99% cpu (sys) since almost 7 minutes ...

slabtop, vmstat, meminfo attached just now.

Regarding the meta data dump,
we talk about ~ 1.6 GB bzip2 (raw dd'ed tmeta volume)

I'll also try to produce some perf output.
Due do involved communication overhead [ :-) ]
this may take a few days, though.

Comment 7 Lars Ellenberg 2013-08-06 22:48:31 UTC
Created attachment 783547 [details]
perf-report

grabbed by:
while ! pid=$(pgrep lvremove); do sleep 1; done ; perf record -ag -p $pid

full perf.data available,
or different perf report options, if you like.

Or maybe it is useful enough as it is?

Comment 13 Joe Thornber 2015-10-16 16:06:01 UTC
Patches went in in October 2014 to prefetch metadata which greatly speeds up this situation.

Comment 14 Lars Ellenberg 2015-10-16 19:34:15 UTC
uhm, it was *CPU* limitted. 100% actual cpu usage on a single core.
NOT IO-limited at all.
So how can prefetch of meta data IO "greatly speed up"
something that is cpu bound?
Just curious...

Comment 15 Joe Thornber 2016-01-29 18:30:13 UTC
My guess is the underlying issue is still to do with metadata paging in and out due to the very large thin device.

You're using a very fast device for the metadata, so the IO delay will be minimal.  But every time a metadata page is read or written a checksum of it is calculated.  Try tweaking the bufio parameters to increase the cache size.