Bug 1763895
Summary: | thin_restore fails with transaction_manager::new_block() couldn't allocate new block | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | bugzilla | ||||
Component: | device-mapper-persistent-data | Assignee: | Joe Thornber <thornber> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Lin Li <lilin> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.7 | CC: | agk, heinzm, jbrassow, lilin, lvm-team, msnitzer, nkshirsa, thornber | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1899131 (view as bug list) | Environment: | |||||
Last Closed: | 2020-11-18 15:18:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1899131 | ||||||
Attachments: |
|
Description
bugzilla
2019-10-21 20:37:35 UTC
(In reply to bugzilla from comment #0) > Description of problem: > > We are trying to thin_restore from a thin_dump xml and get the following > error: > > ~]# thin_restore -i /mnt/tmp/tmeta.xml -o /dev/data/tmeta-dest > truncating metadata device to 4161600 4k blocks > Restoring: [====> ] | 11% > transaction_manager::new_block() couldn't allocate new block > > Version-Release number of selected component (if applicable): > > 8.5 > > How reproducible: > > Very > > Steps to Reproduce: > 1. lvcreate -L 16g /dev/data/tmeta-dest > 2. wget https://www.duetsolution.com/static/tmeta.xml.bz2 > 3. bunzip2 tmeta.xml.bz2 > 4. thin_restore -i tmeta.xml -o /dev/data/tmeta-dest Don't need to download 8.3G to verify the existing code will fail like you've shown. Knowing _why_ would be nice but before getting to that: Restore is likely failing due to bad metadata. How did the thin_repair phase go? What steps did you take? We tried thin_repair first, which failed as follows: ~]# thin_repair -i /dev/mapper/data-data--pool_tmeta -o /dev/data/tmeta-dest truncating metadata device to 4161600 4k blocks terminate called after throwing an instance of 'std::runtime_error' what(): transaction_manager::new_block() couldn't allocate new block Aborted (core dumped) ~]# /usr/sbin/thin_repair -V 0.8.5-1.el7 Since it failed, we tried a thin_dump --repair and then a thin_restore. But it failed as shown above. Hi Eric, did you miss my email that I sent out after I saw your initial mail? We ran into this issue a few months ago on 0.8.5 and it was subsequently fixed in the later versions of the repair tools. re-pasting my email. -nikhil. ---------- Forwarded message --------- From: Nikhil Kshirsagar <nkshirsa> Date: Sat, Oct 19, 2019 at 8:17 AM Subject: Re: [lvm-devel] transaction_manager::new_block() couldn't allocate new block To: LVM2 development <lvm-devel> I recollect running into this one a few months ago on 0.8.5 , and i think Joe fixed this in 0.8.5-1 , please try with the latest version of the pdata tools. On Sat, 19 Oct, 2019, 6:08 AM Eric Wheeler, <lvm-devel.net> wrote: > > Hello all, > > We are attempting to repair a thin meta volume and get the following > error after it runs for a while: > > ~]# thin_repair -V > 0.8.5 > > ~]# thin_repair -i /dev/mapper/data-data--pool_tmeta -o /dev/data/tmeta-dest > truncating metadata device to 4161600 4k blocks > terminate called after throwing an instance of 'std::runtime_error' > what(): transaction_manager::new_block() couldn't allocate new block > Aborted (core dumped) > > How I can troubleshoot this further? > > I'm happy to try patches against thin_repair if you would like. I'm also > trying thin_dump/thin_restore, so we will see how that goes---but I > thought you might want to know in case there is a bug in thin_repair that > could be fixed while I have the metadata in this state. > > -- > Eric Wheeler > > -- > lvm-devel mailing list > lvm-devel > https://www.redhat.com/mailman/listinfo/lvm-devel Created attachment 1628147 [details]
0.8.5-1 repair tools to test with
Here is the binary `dd` dump of the metadata for inspecting thin_repair: https://www.duetsolution.com/static/tmeta.bin.bz2 [5GB] Hi Nikhil, Thanks for jumping in. I saw your email but haven't yet had a result to report with you. The thin_restore might be getting farther (its been going for hours) and I will let you know the result when its done. However, thin_repair still bombs with 0.8.5-1. Is the attachment you provided different than this one? ~]# rpm -q device-mapper-persistent-data device-mapper-persistent-data-0.8.5-1.el7.x86_64 -Eric Hi Eric, If repair bombed I'm not sure how you are restoring. The attachment is right, but perhaps you need a newer version of the thin repair tools since the one I have sent you worked for me for these issues .. i do know there are newer releases than 0.8.5-1 and i can provide you those but first I'll try to repair the metadata myself with the new tools.. I'm currently downloading the metadata, so will be able to give more feedback later today. But one thing that occurs to me is thin_dump/restore effectively lose any sharing that was present between *metadata* blocks. So if you're using a lot of snapshots that could explain why we're running out of metadata on the restore. I'm currently downloading the metadata, so will be able to give more feedback later today. But one thing that occurs to me is thin_dump/restore effectively lose any sharing that was present between *metadata* blocks. So if you're using a lot of snapshots that could explain why we're running out of metadata on the restore. I've looked at the metadata now, and yes the issue is the large amount of sharing you've got in there. I'll improve thin_dump/restore to maintain the sharing. But it's a non-trivial change, so will take about a week. Thanks Joe. What about thin_repair? Does it maintain sharing? No, thin_repair is just thin_dump --repair | thin_restore, but without going through the intermediate xml. Interesting, good to know. Hypothetically, is there an "optimal" meta data layout that optmizes meta sharing to minimize meta usage? If there is, then thin_repair might be able to reduce metadata usage. I don't think so. The sharing reflects the history of the device wrt snapshots. When you take a fresh snapshot the sharing is 100%, and as you break sharing in blocks I copy-on-write the minimum btree nodes possible to cope with the new entry. What would be far more useful would be to start storing ranges of mappings in the btrees rather than individual blocks. That way we'd use far less space except in the cases where the mappings were extremely fragmented. From the tools point of view it would be very interesting to have a defrag tool. I see from your metadata that you've been creating and deleting a *lot* of snapshots (transaction-id 1,3 million!), so I'd like to analyse it to see how fragmented the data has become. Range mappings sounds like a great idea. I see the XML has range mappings, but range_map just loops over single_map in restore_emitter.cc . Does the on-disk metadata already support ranges? Not at the moment. Switching to ranges also opens up some other enhancements: - Stop splitting IO on block boundaries, so we can allocate multiple blocks at once. - Use a fixed small block size, eg, 4k which is great for snapshots. If we're allocating multiple blocks at once then the user no longer needs to choose an optimum block size balancing snaps or provisioning performance. Update: I have an experimental version of thin_dump running which dumps shared subtree fragments before dumping the devices. The devices then refer to these fragments rather than repeating the mappings. Given the large metadata size I hit some performance issues, which I've got round via a mix of prefetching blocks and being careful to not walk too deep when hitting shared nodes. thin_dump on my dev box is running in ~7 minutes. The output xml is about 18 gig, only slightly bigger than the binary format. Which demonstrates how badly we need to switch to storing ranges in the binary format. Next week thin_restore. Awesome, thanks for the update Joe! Does that mean thin_repair will be read when thin_restore is ready, too? Hi Joe, How is thin_restore coming along? -Eric We are making great progress on improvements, but they won't land in time for 7.8. To finish development and testing will require us to push to 7.9 Thanks Jonathan. Besides the shared meta merging, what other improvement"s" are you working on? Also, will this be merged into lvm2.git? We are working with broken meta and I would like to solve this ASAP. Hi Eric, I'm working on this full time, but it's a tough problem, particularly given the large size of your metadata. I have a crude version of thin_restore running. It takes about 20 minutes to restore the large xml file generated by dumping your metadata. The following issues are still outstanding. - At the moment I merge a shared subtree into the parent btree by just punching it into the root node. Because there are so many shared subtrees that are simply a leaf node this is resulting in severely unbalanced btrees. eg with depth ~250. Using a merge alg that rebalances as it goes removes the sharing, and so defeats the purpose somewhat. - The kernel code is being too enthusiastic about splitting nodes. Allocating blocks in sequence results in leaf nodes that are 50% full. So kernel patch incoming. And when the new restore is done it will pack things more tightly and you will recover a lot of metadata space. - The thin_dump tool needs to aggregate shared sub trees that are always referenced together. This will greatly reduce the balancing issue, and is required to improve the node residency. For instance your metadata has ~550k shared subtrees, most of which are half full leaf nodes. (This is what I'm currently working on). - I'm not restoring space maps yet. I turned this off to speed things up when restore was taking several hours. Hopefully turning back on will not effect things, but you never know. Looking forward to it, I greatly appreciate all of your effort on this! I'm especially looking forward to the packed metadata because we are getting scary-close to 100% full on the production and every time we hit 100% meta the resulting crash has always been a restore-from-backup situation. Being able to re-pack our metadata with thin_dump/restore will help a lot. Hi Joe, I hope you are well! How is this coming along? -Eric Hello Joe and Jonathan, Has this been released? We have another system that has a severe pool metadata issue and I would like to freshen the meta to compact it with the code that Joe said he was working on. LV VG Attr LSize Data% Meta% data-pool data twi-aot--- 7.98t 90.68 96.49 [data-pool_tdata] data Twi-ao---- 7.98t [data-pool_tmeta] data ewi-ao---- 16.00g Note that this volume is only 8tb in size, but its metadata volume is 16gb and 96.5% full so we have to do something about it. Questions: 1. Is there an upstream version of thin_restore that supports shared blocks? 2. Joe said a kernel patch is forthcoming, but I have not seen it. Is it out? If so, please attach the patch to this bug article so we can try it out. If not, when will it be available? Please escalate as appropriate, I appreciate your help with this. Let me know if you need to see a copy of the metadata dump for inspection, but it is probably similar to the one that you have already seen. -Eric Hi Eric, I did a lot of work on this in the spring, but I never managed to restore your metadata successfully. At the time I thought it was due to a bug in the newly written code. However, I've recently been using your metadata as a test case for a rewrite of thin tools in Rust and I'm finding issues in the btree keys which could explain things (to be confirmed). I hope to release the Rust tools this autumn. I'm sorry progress has been so slow. The kernel patch has not been written yet. We're planning some modest changes to the metadata which will allow us to compress the metadata by about 5 times. But it'll be 2021 at the earliest for that. - Joe Cloning for RHEL8. I don't think this will get into RHEL7 since it requires the Rust version of the tools and we won't be allowed to make such a large change this late in the RHEL7 lifetime. |