Bug 1763895 - thin_restore fails with transaction_manager::new_block() couldn't allocate new block
Summary: thin_restore fails with transaction_manager::new_block() couldn't allocate ne...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: device-mapper-persistent-data
Version: 7.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: rc
: ---
Assignee: Joe Thornber
QA Contact: Lin Li
URL:
Whiteboard:
Depends On:
Blocks: 1899131
TreeView+ depends on / blocked
 
Reported: 2019-10-21 20:37 UTC by bugzilla
Modified: 2021-09-03 12:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1899131 (view as bug list)
Environment:
Last Closed: 2020-11-18 15:18:34 UTC
Target Upstream Version:


Attachments (Terms of Use)
0.8.5-1 repair tools to test with (421.39 KB, application/x-rpm)
2019-10-23 03:08 UTC, nikhil kshirsagar
no flags Details

Description bugzilla 2019-10-21 20:37:35 UTC
Description of problem:

We are trying to thin_restore from a thin_dump xml and get the following error:

~]# thin_restore -i /mnt/tmp/tmeta.xml -o /dev/data/tmeta-dest
truncating metadata device to 4161600 4k blocks
Restoring: [====>                                             ] | 11%
transaction_manager::new_block() couldn't allocate new block

Version-Release number of selected component (if applicable):

8.5

How reproducible:

Very

Steps to Reproduce:
1. lvcreate -L 16g /dev/data/tmeta-dest
2. wget https://www.duetsolution.com/static/tmeta.xml.bz2
3. bunzip2 tmeta.xml.bz2
4. thin_restore -i tmeta.xml -o /dev/data/tmeta-dest


Actual results:

transaction_manager::new_block() couldn't allocate new block

Expected results:

Should restore the tmeta volume

Additional info:

A production server has been offline for several days trying to repair its metadata.  Please test and propose a fix or patch ASAP.  Thank you for your help!

Comment 2 Mike Snitzer 2019-10-22 18:06:09 UTC
(In reply to bugzilla from comment #0)
> Description of problem:
> 
> We are trying to thin_restore from a thin_dump xml and get the following
> error:
> 
> ~]# thin_restore -i /mnt/tmp/tmeta.xml -o /dev/data/tmeta-dest
> truncating metadata device to 4161600 4k blocks
> Restoring: [====>                                             ] | 11%
> transaction_manager::new_block() couldn't allocate new block
> 
> Version-Release number of selected component (if applicable):
> 
> 8.5
>
> How reproducible:
> 
> Very
> 
> Steps to Reproduce:
> 1. lvcreate -L 16g /dev/data/tmeta-dest
> 2. wget https://www.duetsolution.com/static/tmeta.xml.bz2
> 3. bunzip2 tmeta.xml.bz2
> 4. thin_restore -i tmeta.xml -o /dev/data/tmeta-dest

Don't need to download 8.3G to verify the existing code will fail like you've shown.  Knowing _why_ would be nice but before getting to that:
Restore is likely failing due to bad metadata.  How did the thin_repair phase go?

What steps did you take?

Comment 3 bugzilla 2019-10-23 02:02:05 UTC
We tried thin_repair first, which failed as follows:

~]# thin_repair -i /dev/mapper/data-data--pool_tmeta -o /dev/data/tmeta-dest
truncating metadata device to 4161600 4k blocks
terminate called after throwing an instance of 'std::runtime_error'
  what():  transaction_manager::new_block() couldn't allocate new block

Aborted (core dumped)

~]# /usr/sbin/thin_repair -V
0.8.5-1.el7

Since it failed, we tried a thin_dump --repair and then a thin_restore.  But it failed as shown above.

Comment 4 nikhil kshirsagar 2019-10-23 02:40:08 UTC
Hi Eric,

did you miss my email that I sent out after I saw your initial mail? We ran into this issue a few months ago on 0.8.5 and it was subsequently fixed in the later versions of the repair tools.

re-pasting my email.

-nikhil.

---------- Forwarded message ---------
From: Nikhil Kshirsagar <nkshirsa@redhat.com>
Date: Sat, Oct 19, 2019 at 8:17 AM
Subject: Re: [lvm-devel] transaction_manager::new_block() couldn't allocate new block
To: LVM2 development <lvm-devel@redhat.com>


I recollect running into this one a few months ago on 0.8.5 , and i think Joe fixed this in 0.8.5-1 , please try with the latest version of the pdata tools.

On Sat, 19 Oct, 2019, 6:08 AM Eric Wheeler, <lvm-devel@lists.ewheeler.net> wrote:
>
> Hello all,
>
> We are attempting to repair a thin meta volume and get the following
> error after it runs for a while:
>
> ~]# thin_repair -V
> 0.8.5
>
> ~]# thin_repair -i /dev/mapper/data-data--pool_tmeta -o /dev/data/tmeta-dest
> truncating metadata device to 4161600 4k blocks
> terminate called after throwing an instance of 'std::runtime_error'
>   what():  transaction_manager::new_block() couldn't allocate new block
> Aborted (core dumped)
>
> How I can troubleshoot this further?
>
> I'm happy to try patches against thin_repair if you would like.  I'm also
> trying thin_dump/thin_restore, so we will see how that goes---but I
> thought you might want to know in case there is a bug in thin_repair that
> could be fixed while I have the metadata in this state.
>
> --
> Eric Wheeler
>
> --
> lvm-devel mailing list
> lvm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/lvm-devel

Comment 5 nikhil kshirsagar 2019-10-23 03:08:32 UTC
Created attachment 1628147 [details]
0.8.5-1 repair tools to test with

Comment 6 bugzilla 2019-10-23 03:37:54 UTC
Here is the binary `dd` dump of the metadata for inspecting thin_repair:

https://www.duetsolution.com/static/tmeta.bin.bz2 [5GB]

Comment 7 bugzilla 2019-10-23 03:41:32 UTC
Hi Nikhil,

Thanks for jumping in.  I saw your email but haven't yet had a result to report with you.

The thin_restore might be getting farther (its been going for hours) and I will let you know the result when its done.

However, thin_repair still bombs with 0.8.5-1.

Is the attachment you provided different than this one?

 ~]# rpm -q device-mapper-persistent-data
device-mapper-persistent-data-0.8.5-1.el7.x86_64

-Eric

Comment 8 nikhil kshirsagar 2019-10-23 04:21:13 UTC
Hi Eric,

If repair bombed I'm not sure how you are restoring. The attachment is right, but perhaps you need a newer version of the thin repair tools since the one I have sent you worked for me for these issues .. i do know there are newer releases than 0.8.5-1 and i can provide you those but first I'll try to repair the metadata myself with the new tools..

Comment 9 Joe Thornber 2019-10-23 08:15:27 UTC
I'm currently downloading the metadata, so will be able to give more feedback later today.

But one thing that occurs to me is thin_dump/restore effectively lose any sharing that was
present between *metadata* blocks.  So if you're using a lot of snapshots that could explain
why we're running out of metadata on the restore.

Comment 10 Joe Thornber 2019-10-23 08:15:38 UTC
I'm currently downloading the metadata, so will be able to give more feedback later today.

But one thing that occurs to me is thin_dump/restore effectively lose any sharing that was
present between *metadata* blocks.  So if you're using a lot of snapshots that could explain
why we're running out of metadata on the restore.

Comment 11 Joe Thornber 2019-10-23 12:48:09 UTC
I've looked at the metadata now, and yes the issue is the large amount of sharing you've got in there.  I'll improve thin_dump/restore to maintain the sharing.  But it's a non-trivial change, so will take about a week.

Comment 12 bugzilla 2019-10-23 17:30:30 UTC
Thanks Joe. 

What about thin_repair?  Does it maintain sharing?

Comment 13 Joe Thornber 2019-10-24 10:40:13 UTC
No, thin_repair is just thin_dump --repair | thin_restore, but without going through the intermediate xml.

Comment 14 bugzilla 2019-10-24 21:30:44 UTC
Interesting, good to know. 

Hypothetically, is there an "optimal" meta data layout that optmizes meta sharing to minimize meta usage? If there is, then thin_repair might be able to reduce metadata usage.

Comment 15 Joe Thornber 2019-10-25 09:37:49 UTC
I don't think so.  The sharing reflects the history of the device wrt snapshots.  When you take a fresh snapshot the
sharing is 100%, and as you break sharing in blocks I copy-on-write the minimum btree nodes possible to cope with the new entry.

What would be far more useful would be to start storing ranges of mappings in the btrees rather than individual blocks.  That way we'd use far less space except in the cases where the mappings were extremely fragmented.

From the tools point of view it would be very interesting to have a defrag tool.  I see from your metadata that you've been creating and deleting a *lot* of snapshots (transaction-id 1,3 million!), so I'd like to analyse it to see how fragmented the data has become.

Comment 16 bugzilla 2019-10-28 19:41:13 UTC
Range mappings sounds like a great idea.  I see the XML has range mappings, but range_map just loops over single_map in restore_emitter.cc .

Does the on-disk metadata already support ranges?

Comment 17 Joe Thornber 2019-10-29 11:13:44 UTC
Not at the moment.  Switching to ranges also opens up some other enhancements:

- Stop splitting IO on block boundaries, so we can allocate multiple blocks at once.
- Use a fixed small block size, eg, 4k which is great for snapshots.  If we're allocating
  multiple blocks at once then the user no longer needs to choose an optimum block size balancing
  snaps or provisioning performance.

Comment 18 Joe Thornber 2019-11-02 11:39:40 UTC
Update:

I have an experimental version of thin_dump running which dumps shared subtree fragments before dumping the devices.  The
devices then refer to these fragments rather than repeating the mappings.

Given the large metadata size I hit some performance issues, which I've got round via a mix of prefetching blocks and being careful
to not walk too deep when hitting shared nodes.

thin_dump on my dev box is running in ~7 minutes.  The output xml is about 18 gig, only slightly bigger than the
binary format.  Which demonstrates how badly we need to switch to storing ranges in the binary format.

Next week thin_restore.

Comment 19 bugzilla 2019-11-07 00:45:17 UTC
Awesome, thanks for the update Joe!

Does that mean thin_repair will be read when thin_restore is ready, too?

Comment 20 bugzilla 2019-11-21 21:06:32 UTC
Hi Joe,

How is thin_restore coming along?

-Eric

Comment 21 Jonathan Earl Brassow 2019-11-27 15:39:48 UTC
We are making great progress on improvements, but they won't land in time for 7.8.  To finish development and testing will require us to push to 7.9

Comment 22 bugzilla 2019-11-27 20:27:44 UTC
Thanks Jonathan.

Besides the shared meta merging, what other improvement"s" are you working on?

Also, will this be merged into lvm2.git?  We are working with broken meta and I would like to solve this ASAP.

Comment 23 Joe Thornber 2019-12-04 15:46:57 UTC
Hi Eric,

I'm working on this full time, but it's a tough problem, particularly given the
large size of your metadata.

I have a crude version of thin_restore running.  It takes about 20 minutes to restore the large xml file
generated by dumping your metadata.

The following issues are still outstanding.

- At the moment I merge a shared subtree into the parent btree by just punching it into
  the root node.  Because there are so many shared subtrees that are simply a leaf node
  this is resulting in severely unbalanced btrees.  eg with depth ~250.  Using a merge alg
  that rebalances as it goes removes the sharing, and so defeats the purpose somewhat.

- The kernel code is being too enthusiastic about splitting nodes.  Allocating
  blocks in sequence results in leaf nodes that are 50% full.  So kernel patch incoming.
  And when the new restore is done it will pack things more tightly and you will
  recover a lot of metadata space.

- The thin_dump tool needs to aggregate shared sub trees that are always referenced together.
  This will greatly reduce the balancing issue, and is required to improve the node residency.
  For instance your metadata has ~550k shared subtrees, most of which are half full leaf nodes.
  (This is what I'm currently working on).

- I'm not restoring space maps yet.  I turned this off to speed things up when restore was taking
  several hours.  Hopefully turning back on will not effect things, but you never know.

Comment 24 bugzilla 2019-12-06 00:53:04 UTC
Looking forward to it, I greatly appreciate all of your effort on this!

I'm especially looking forward to the packed metadata because we are getting scary-close to 100% full on the production and every time we hit 100% meta the resulting crash has always been a restore-from-backup situation.  Being able to re-pack our metadata with thin_dump/restore will help a lot.

Comment 25 bugzilla 2020-04-20 23:00:04 UTC
Hi Joe,

I hope you are well!

How is this coming along?

-Eric

Comment 26 bugzilla 2020-09-08 19:01:16 UTC
Hello Joe and Jonathan,

Has this been released? We have another system that has a severe pool metadata issue and I would like to freshen the meta to compact it with the code that Joe said he was working on.

  LV                  VG     Attr       LSize    Data%  Meta%
  data-pool           data   twi-aot---   7.98t  90.68  96.49                           
  [data-pool_tdata]   data   Twi-ao----   7.98t                                                                             
  [data-pool_tmeta]   data   ewi-ao----  16.00g                                                                             

Note that this volume is only 8tb in size, but its metadata volume is 16gb and 96.5% full so we have to do something about it.


Questions:
1. Is there an upstream version of thin_restore that supports shared blocks?
2. Joe said a kernel patch is forthcoming, but I have not seen it. Is it out? If so, please attach the patch to this bug article so we can try it out. If not, when will it be available?

Please escalate as appropriate, I appreciate your help with this. Let me know if you need to see a copy of the metadata dump for inspection, but it is probably similar to the one that you have already seen.

-Eric

Comment 27 Joe Thornber 2020-09-14 14:40:13 UTC
Hi Eric,

I did a lot of work on this in the spring, but I never managed to restore your metadata successfully.  At the time I thought it was due to a bug in the newly written code.

However, I've recently been using your metadata as a test case for a rewrite of thin tools in Rust and I'm finding issues in the btree keys which could explain things (to be confirmed).

I hope to release the Rust tools this autumn.  I'm sorry progress has been so slow.

The kernel patch has not been written yet.  We're planning some modest changes to the metadata which will allow us to compress the metadata by about 5 times.  But it'll be 2021 at the earliest for that.

- Joe

Comment 28 Joe Thornber 2020-11-18 15:18:34 UTC
Cloning for RHEL8.  I don't think this will get into RHEL7 since it requires the Rust version of the tools and we won't be allowed to make such a large change this late in the RHEL7 lifetime.


Note You need to log in before you can comment on or make changes to this bug.