Bug 1532071 - Out of space on Thin LVM, thin_repair failed to recover properly
Summary: Out of space on Thin LVM, thin_repair failed to recover properly
Keywords:
Status: ASSIGNED
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: 2.02.116
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Joe Thornber
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-07 21:16 UTC by Alex
Modified: 2023-08-10 15:40 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
rule-engine: lvm-technical-solution?
rule-engine: lvm-test-coverage?


Attachments (Terms of Use)
Tarfile containing a compressed image of the meta device and vgcfgbackup file (6.69 MB, application/x-tar)
2018-01-07 21:16 UTC, Alex
no flags Details
Requested metadata and cfg (19.42 MB, application/x-bzip)
2018-01-11 22:31 UTC, terryr
no flags Details
meta1, meta2, and /etc/lvm (2.07 MB, application/x-rar)
2018-01-11 23:14 UTC, terryr
no flags Details
Kernel logs (276.86 KB, application/x-bzip)
2018-01-16 22:24 UTC, Alex
no flags Details
My condensed version of the kernel logs and story. (4.27 KB, text/plain)
2018-01-16 22:28 UTC, Alex
no flags Details

Description Alex 2018-01-07 21:16:48 UTC
Created attachment 1378229 [details]
Tarfile containing a compressed image of the meta device and vgcfgbackup file

Description of problem: 
I took a diskimage prior to any repair.
I attempted a repair on 2.02.116 with thin_repair 0.3.1 on kernel 4.4.98 which failed.

Then I restored from my image and tried again from a bootable archlinux USB key running: 
LVM Version: 2.02.176(2) (2017-11-03)
thin_repair --version: 0.7.5
Linux archiso 4.13.12-1-ARCH #1 SMP PREEMPT Wed Nov 8 11:54:06 CET 2017 x86_64 GNU/Linux

And I got the same failure.

    Executing: /usr/bin/thin_check --clear-needs-check-flag /dev/mapper/pve-data_tmeta
examining superblock
examining devices tree
examining mapping tree
  missing all mappings for devices: [0, -]
    value size mismatch: expected 8, but got 24. This is not the btree you are looking for. (block 70)
    /usr/bin/thin_check failed: 1
  Check of pool pve/data failed (status:1). Manual repair required!
    Removing pve-data_tdata (254:3)
    Removing pve-data_tmeta (254:2)


Version-Release number of selected component (if applicable):


How reproducible: Unknown


Steps to Reproduce:
1. Run out of space.
2. Reboot.
3. Repeat same thing that caused you to run out of space again.
4. Reboot.
5. LVM failed to come up, and thin_repair can't fix it, even when given a 300MB lv to repair into.

Actual results: 
No mapping for any thinLV, data inaccessible.

Expected results:
Data to be recoverable at least.

Additional info:
Proxmox VE 4.4-20/2650b7b5
Linux atomic 4.4.98-2-pve #1 SMP PVE 4.4.98-101 (Mon, 18 Dec 2017 13:36:02 +0100) x86_64 GNU/Linux
LVM 2.02.116
thin_repair 0.3.1

Comment 1 Alex 2018-01-07 21:21:37 UTC
I forgot to add checksum
SHA512(lvm-meta-cfgbackup_atomic.tar)= f10730f50ae9c51bbf35c769d22640d1ca6764fd29291b80d6f581bab8f674ec387bc61b5678a585919cf611856ad1bd414fff85c527448816924c9c45035b9b

Comment 2 terryr 2018-01-11 22:31:24 UTC
Created attachment 1380237 [details]
Requested metadata and cfg

Similar circumstances to Alex.  Unrecodnized overprovisioned disk and then power outage causing corruption.

Comment 3 terryr 2018-01-11 23:14:51 UTC
Created attachment 1380239 [details]
meta1, meta2, and /etc/lvm

Comment 4 Alex 2018-01-16 22:24:53 UTC
Created attachment 1382185 [details]
Kernel logs

Comment 5 Alex 2018-01-16 22:26:17 UTC
[Excerpts from the first time running out of space (and successfully rebooting and everything ending up fine.)]

Dec 23 19:53:56 atomic lvm[1109]: Thin pve-data-tpool is now 95% full.
Dec 23 20:56:32 atomic kernel: [5126470.856309] device-mapper: thin: 251:4: metadata operation 'dm_pool_alloc_data_block' failed: error = -28
Dec 23 20:56:32 atomic kernel: [5126470.856350] device-mapper: thin: 251:4: aborting current metadata transaction
Dec 23 20:56:33 atomic kernel: [5126471.462387] device-mapper: thin: 251:4: switching pool to read-only mode
Dec 23 20:57:00 atomic kernel: [5126499.188962] Buffer I/O error on dev dm-23, logical block 46016, lost async page write
Dec 23 20:57:31 atomic kernel: [5126529.684233] buffer_io_error: 23 callbacks suppressed
Dec 23 20:57:53 atomic kernel: [5126551.427673] buffer_io_error: 164 callbacks suppressed
<...>
Dec 23 21:17:31 atomic kernel: [5127730.237332] buffer_io_error: 19462 callbacks suppressed
Dec 23 21:17:31 atomic kernel: [5127730.237744] Buffer I/O error on dev dm-23, logical block 22480714, lost async page write
Dec 23 21:17:34 atomic kernel: [5127732.988048] VFS: Dirty inode writeback failed for block device dm-23 (err=-5).
<...>
reboot
<...>
Dec 23 21:46:29 atomic kernel: [    0.918952] device-mapper: uevent: version 1.0.3
Dec 23 21:46:29 atomic kernel: [    0.919008] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel
<...>
Dec 23 21:46:31 atomic lvm[1059]: Thin pve-data-tpool is now 98% full.
Dec 23 21:46:29 atomic kernel: [    9.077473] Adding 7340028k swap on /dev/mapper/pve-swap.  Priority:-1 extents:1 across:7340028k FS
Dec 23 21:46:29 atomic kernel: [   17.543264] device-mapper: thin: Data device (dm-3) discard unsupported: Disabling discard passdown.
<...>



(At this point everything seems fine and I copy the same file over again, run out of space again and...)
<...>
Dec 23 22:11:07 atomic kernel: [ 1507.448976] device-mapper: thin: 251:4: metadata operation 'dm_pool_alloc_data_block' failed: error = -28
Dec 23 22:11:07 atomic kernel: [ 1507.448995] device-mapper: thin: 251:4: aborting current metadata transaction
Dec 23 22:11:07 atomic kernel: [ 1507.505452] device-mapper: thin: 251:4: switching pool to read-only mode
Dec 23 22:11:08 atomic kernel: [ 1507.601996] attempt to access beyond end of device
Dec 23 22:11:08 atomic kernel: [ 1507.601999] dm-2: rw=0, want=2800344, limit=180224
Dec 23 22:11:08 atomic kernel: [ 1507.602002] device-mapper: thin: __process_bio_read_only: dm_thin_find_block() failed: error = -5
<..thousand lines later..>
Dec 23 22:11:08 atomic kernel: [ 1507.617040] Buffer I/O error on dev dm-22, logical block 24090176, async page read
<...>
Dec 23 22:11:21 atomic kernel: [ 1521.163384] device-mapper: thin: dm_thin_get_highest_mapped_block returned -5
Dec 23 22:11:21 atomic kernel: [ 1521.163419] device-mapper: thin: dm_thin_get_highest_mapped_block returned -15
<...>
Dec 23 22:11:31 atomic kernel: [ 1530.602487] device-mapper: btree spine: node_check failed: blocknr 0 != wanted 21938
Dec 23 22:11:31 atomic kernel: [ 1530.602523] device-mapper: block manager: btree_node validator check failed for block 21938
Dec 23 22:11:31 atomic kernel: [ 1530.602568] device-mapper: btree spine: node_check failed: blocknr 0 != wanted 21938
Dec 23 22:11:31 atomic kernel: [ 1530.602600] device-mapper: block manager: btree_node validator check failed for block 21938
Dec 23 22:11:31 atomic kernel: [ 1530.602634] device-mapper: btree spine: node_check failed: blocknr 0 != wanted 21938
<...>
Dec 23 22:11:31 atomic kernel: [ 1531.523072] device-mapper: thin: dm_thin_get_highest_mapped_block returned -22
<...>
Dec 23 22:20:17 atomic kernel: [ 2057.082854] device-mapper: thin: __process_bio_read_only: dm_thin_find_block() failed: error = -5
Dec 23 22:20:17 atomic kernel: [ 2057.082893] Buffer I/O error on dev dm-14, logical block 1108686, async page read
Dec 23 22:21:24 atomic kernel: [    0.000000] Initializing cgroup subsys cpuset
(reboot)
<...>
Dec 23 22:21:24 atomic lvm[425]: Check of pool pve/data failed (status:1). Manual repair required!
(Now it seems to be FUBAR'd... I took a full disk image at this point. Follows is excerpt from thin_check)
examining mapping tree
  missing all mappings for devices: [0, -]
    value size mismatch: expected 8, but got 24. This is not the btree you are looking for. (block 70)

Comment 6 Alex 2018-01-16 22:28:27 UTC
Created attachment 1382186 [details]
My condensed version of the kernel logs and story.

Comment 7 Joe Thornber 2019-10-07 14:09:32 UTC
thin_repair still fails to recover any volumes, even with the recent changes (0.8.5)


Note You need to log in before you can comment on or make changes to this bug.