This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 990583 - lvremove of thin snapshots takes 5 to 20 minutes (single core cpu bound?)
lvremove of thin snapshots takes 5 to 20 minutes (single core cpu bound?)
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Joe Thornber
Storage QE
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-31 09:38 EDT by Lars Ellenberg
Modified: 2017-09-15 06:57 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-15 06:57:24 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
slabtop (5.11 KB, text/plain)
2013-08-06 06:43 EDT, Lars Ellenberg
no flags Details
vmstat (1.99 KB, text/plain)
2013-08-06 06:44 EDT, Lars Ellenberg
no flags Details
meminfo (1.18 KB, text/plain)
2013-08-06 06:44 EDT, Lars Ellenberg
no flags Details
perf-report (434.25 KB, application/x-gzip)
2013-08-06 18:48 EDT, Lars Ellenberg
no flags Details

  None (edit)
Description Lars Ellenberg 2013-07-31 09:38:12 EDT
Description of problem:
lvremove of thin snapshot takes 5 to 20 minutes.

Use case:
one "main" volume, several (here: 28) periodic 1st level snapshots,
usually at most one 2nd level snapshot.

new snap is created (usually a couple of seconds),
then the oldest snap is removed.

application is quiescent at the point in time.

meta volume is on fusionio drives in this case,
so should not be an IO problem, really.
Rather looks like something in-kernel is single core cpu bound.

  db                     bu4db  Vwi-aotz   1.10t pool1         87.23
  db.20130727.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.11
  db.20130727.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130727.150002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130727.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130727.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.12
  db.20130728.030003.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.150002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130728.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.13
  db.20130729.030007.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.150003.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.14
  db.20130729.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.16
  db.20130730.030001.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.16
  db.20130730.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.17
  db.20130730.110003.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.19
  db.20130730.150002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.19
  db.20130730.190002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.22
  db.20130730.230002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.22
  db.20130731.030002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.23
  db.20130731.070002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.23
  db.20130731.110002.raw bu4db  Vwi-a-tz   1.10t pool1 db      87.23
  pool1                  bu4db  twi-a-tz   1.95t               74.71

metadata_percent: 24.02

Version-Release number of selected component (if applicable):
  LVM version:     2.02.98(2)-RHEL6 (2012-10-15)
  Library version: 1.02.77-RHEL6 (2012-10-15)
  Driver version:  4.22.6

same behaviour with rhel 6.3  -279* and 6.4 -358* kernels.


How reproducible:
very.
our logfiles consistently show lvremove times of about 300 to 1200 seconds,
with an average of 560 seconds for the last 200 iterations
(more logs available).


Steps to Reproduce:
What we do:
1. create and fill a thin volume
2. keep filling it with random IO (it's a postgres on top of this)
3. keep creating a number of snapshots every few hours
4. remove the oldest snapshot once you reach ~ 20 or 30 snapshots
   or fill percent of meta or pool suggests removal would be a good idea.

Actual results:
lvremove seems to be single core (in-kernel?) cpu-bound.

Expected results:
don't crunch so much, or at least use more than one core?

Additional info:
note: we use 64k chunk size, so the btree is probably "massive".

better behaviour with fewer snapshots, or lower metadata_percent.

I have several such systems where I could monitor
whatever system parameters you may be interested in.
however do "arbitrary experiments": not so much.
depending on what exactly you suggest ;-)

To be clear, at this point this is not a real problem,
the system would sit mostly idle for a few hours, anyways,
until the next snapshot iteration.

But it is a "major annoyance".

Anything we can do to help nail down the actual bottleneck: let me know.

Thanks,

    Lars
Comment 2 Zdenek Kabelac 2013-08-02 05:28:24 EDT
I guess it would be nice to have attached compressed thin pool kernel metadata and related lvm2 metadata.  What size are they compressed?
(size of thin-pool metadata is not visible in lvs output (needs '-a'))

Please attach 'lvs -a -o+metadata_percent'

May I assume all the volumes are active when you run lvremove ?
(Since removal of inactive thin volume when thin pool is inactive would trigger 2 metadata consistency checks - when the pool is activated and deactivated).


Also please try to attach memory usage  'slabtop -o' and 'cat /proc/vmstat'


Also - 'perf record -ag' & 'perf report' of long running lvremove command would help.
Comment 3 Lars Ellenberg 2013-08-06 06:43:39 EDT
Created attachment 783260 [details]
slabtop
Comment 4 Lars Ellenberg 2013-08-06 06:44:11 EDT
Created attachment 783261 [details]
vmstat
Comment 5 Lars Ellenberg 2013-08-06 06:44:43 EDT
Created attachment 783262 [details]
meminfo
Comment 6 Lars Ellenberg 2013-08-06 06:46:20 EDT
in the original post: metadata_percent: 24.02
in our usage, it seems to vary between 23 and 25 %

Right now:
  pool1                  bu4db  twi-a-tz   1.95t  71.43  23.42
  [pool1_tdata]          bu4db  Twi-aot-   1.95t
  [pool1_tmeta]          bu4db  ewi-aot-  16.00g


all volumes are "active" (mappings exist)
only one volume is "open" (the "main" volume is mounted)
no application IO at that time.

I don't see any thin_check hanging around, only
  10854 root       20   0 23220  4468  3312 R 99.0  0.0  6:46.28 lvremove -f bu4db/db.20130802.030003.raw

meaning 99% cpu (sys) since almost 7 minutes ...

slabtop, vmstat, meminfo attached just now.

Regarding the meta data dump,
we talk about ~ 1.6 GB bzip2 (raw dd'ed tmeta volume)

I'll also try to produce some perf output.
Due do involved communication overhead [ :-) ]
this may take a few days, though.
Comment 7 Lars Ellenberg 2013-08-06 18:48:31 EDT
Created attachment 783547 [details]
perf-report

grabbed by:
while ! pid=$(pgrep lvremove); do sleep 1; done ; perf record -ag -p $pid

full perf.data available,
or different perf report options, if you like.

Or maybe it is useful enough as it is?
Comment 13 Joe Thornber 2015-10-16 12:06:01 EDT
Patches went in in October 2014 to prefetch metadata which greatly speeds up this situation.
Comment 14 Lars Ellenberg 2015-10-16 15:34:15 EDT
uhm, it was *CPU* limitted. 100% actual cpu usage on a single core.
NOT IO-limited at all.
So how can prefetch of meta data IO "greatly speed up"
something that is cpu bound?
Just curious...
Comment 15 Joe Thornber 2016-01-29 13:30:13 EST
My guess is the underlying issue is still to do with metadata paging in and out due to the very large thin device.

You're using a very fast device for the metadata, so the IO delay will be minimal.  But every time a metadata page is read or written a checksum of it is calculated.  Try tweaking the bufio parameters to increase the cache size.

Note You need to log in before you can comment on or make changes to this bug.