Bug 1585670 - LVM cache metadata superblock destroyed after hard-reset?
Summary: LVM cache metadata superblock destroyed after hard-reset?
Keywords:
Status: ASSIGNED
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: 2.02.173
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Joe Thornber
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-04 11:40 UTC by devurandom
Modified: 2018-06-14 02:23 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
rule-engine: lvm-technical-solution?
rule-engine: lvm-test-coverage?


Attachments (Terms of Use)
output of several lvm commands (5.80 KB, text/plain)
2018-06-04 11:40 UTC, devurandom
no flags Details
head -c 4890783 cache_cmeta | xz > cache_cmeta.first-4890783-bytes.xz (875.59 KB, application/x-xz)
2018-06-05 10:02 UTC, devurandom
no flags Details
/etc/lvm/backup/ernie (3.91 KB, text/plain)
2018-06-06 09:58 UTC, devurandom
no flags Details
head -c 1M nvme0n1.raw | xz > nvme0n1.raw.first-1048576-bytes.xz (4.48 KB, application/x-xz)
2018-06-07 03:45 UTC, devurandom
no flags Details

Description devurandom 2018-06-04 11:40:07 UTC
Created attachment 1447403 [details]
output of several lvm commands

Description of problem:

I have an issue re-gaining access to my data.  The error messages are:
# lvchange -ay ernie/system
  Check of pool ernie/cache failed (status:1). Manual repair required!
# lvconvert --repair ernie/system
bad checksum in superblock
  Repair of cache metadata volume of cache ernie/system failed (status:1). 
Manual repair required!

I was recommended to run `lvchange -ay vg/lv_cmeta` with LVM 2.02.177 or later,
but that only resulted in an error message:
# lvchange -ay ernie/cache_cmeta
  Operation not permitted on hidden LV ernie/cache_cmeta.

I tried again using LVM 2.02.178-rc1, which was successful, but caused following
`dmesg` output:
[  353.872764] device-mapper: cache metadata: sb_check failed: blocknr 7640232: wanted 0
[  353.873415] device-mapper: block manager: superblock validator check failed for block 0
[  353.874070] device-mapper: cache metadata: couldn't read lock superblock
[  353.875259] device-mapper: table: 253:5: cache: Error creating metadata object
[  353.875924] device-mapper: ioctl: error adding target to table


My most important questions are:

* What exactly is broken?
* What does "manual repair" mean in detail?
* Is there some way to recover the cache?  Or is it at least possible to 
  uncache the LV forcibly, to hopefully recover the data on the origin LV /
  allow it to be mounted without the cache?
* What is your recommendation to minimise data loss in this situation?


Version-Release number of selected component (if applicable):

The live system I am trying to use for recovery is using Fedora 28 with Linux
4.16.3-301.fc28.
# lvm version
  LVM version 2.02.177(2)
  Library version 1.02.146
  Driver version 4.37.0

The LVM binary that allowed me to activate the broken metadata volume was
compiled statically on a different machine, but runs on the same Fedora 28 live
system:
# LD_LIBRARY_PATH=$PWD/lib:$PWD/usr/lib ./bin/lvm version
  LVM version:     2.02.178(2)-rc1 (2018-05-24)
  Library version: 1.02.147-rc1 (2018-05-24)
  Driver version:  4.37.0
  Configuration:   ./configure --enable-static_link --disable-use-lvmetad --disable-use-lvmpolld --disable-use-lvmlockd --disable-valgrind-pool PKG_CONFIG_PATH=/home/dschridde/lvm-rescue-operation-20180530/LVM2.2.02.178-rc1/../bundle/usr/lib/pkgconfig PKG_CONFIG_SYSROOT_DIR=/home/dschridde/lvm-rescue-operation-20180530/LVM2.2.02.178-rc1/../bundle LIBS=-L/home/dschridde/lvm-rescue-operation-20180530/LVM2.2.02.178-rc1/../bundle/lib -L/home/dschridde/lvm-rescue-operation-20180530/LVM2.2.02.178-rc1/../bundle/usr/lib -luuid -lpthread -lm LDFLAGS=-Wl,--as-needed

The system I was originally using to break the cache was using Linux 4.16.7-
gentoo.  Executing the `lvm` command from its initrd on the Fedora 28 live
system reports:
# `LD_LIBRARY_PATH=$PWD/lib64 ./bin/lvm version
  LVM version: 2.02.173(2)
  Library version: 1.02.142
  Driver version: 4.37.0


Additional info:

For a full log, including the output of `pvdisplay`, `vgdisplay` and
`lvdisplay -a`, please see attached log file.  If more information is
necessary, please ask.

What led to this situation:

I am using one of the infamous AMD Ryzen 2400G with an AMD B350 chipset, which 
suffers from random lockups related to CPU C-states [1].  Due to this I had to
reboot (ctrl+alt+del) and reset (hard) the system several times, at the end of
which the system LV could no longer be activated.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=196683

Relation to similar problems:

I already found older mailinglist posts [2,3] describing a similar
scenario.  At the time a recovery was impossible for the user, but it seems
that the situation improved somewhat since then.  However, I am still stick,
with `lvconvert --repair` asking me to repair "manually", `cache_dump --repair`
not being able to operate on non-active LVs and LVM refusing to activate the LV
as long as it has not been repaired.

[2]: https://www.redhat.com/archives/linux-lvm/2016-December/msg00013.html
[3]: https://www.redhat.com/archives/linux-lvm/2015-August/msg00008.html

Comment 1 devurandom 2018-06-04 11:47:05 UTC
P.S. I was asked to attach an XZ compressed copy of the `cache_cmeta` LV to this ticket, but it appears to contain parts of my personal files (I am looking at it with `hexdump -C`).

Comment 2 Zdenek Kabelac 2018-06-04 15:02:20 UTC
(In reply to devurandom from comment #1)
> P.S. I was asked to attach an XZ compressed copy of the `cache_cmeta` LV to
> this ticket, but it appears to contain parts of my personal files (I am
> looking at it with `hexdump -C`).

Hi

Are you sure you were actually looking at the 'right' volume ?

Normally metadata device do not contain any user data (unless they would be there from previous usage of given disk space for i.e. filesystem).

You metadata device is just 48MB big - so maybe you could 'visually' recognize ending of  'btree'  data layout and cut away your 'private' content ?

It's nearly impossible to deduce error type from this report ATM.


Also note - there is tool   'cache_writeback'   - that should be able to offline flush cache data to origin device.


lvm2 version 178 should be able to activate  your _cmeta device 'standalone' (in read-only mode) for easier access to it's content.

Comment 3 devurandom 2018-06-05 10:02:55 UTC
Created attachment 1447788 [details]
head -c 4890783 cache_cmeta | xz > cache_cmeta.first-4890783-bytes.xz

Hi Zdenek!

(In reply to Zdenek Kabelac from comment #2)
> (In reply to devurandom from comment #1)
> > P.S. I was asked to attach an XZ compressed copy of the `cache_cmeta` LV to
> > this ticket, but it appears to contain parts of my personal files (I am
> > looking at it with `hexdump -C`).
> 
> Are you sure you were actually looking at the 'right' volume ?

The cache_cmeta LV (and the file I copied from that using `dd`) is 48MB in size.  There was also a lvol0_spare LV of the same size, but I was not allowed to activate it (operation not permitted on hidden LV).  Hence this is definitely the metadata volume.

> Normally metadata device do not contain any user data (unless they would be
> there from previous usage of given disk space for i.e. filesystem).

Marian Csontos replied with a similar argument on the linux-lvm mailinglist:

> It definitely should not. _cdata is where fragments you your data are 
> stored, and _cmeta contains only metadata (e.g. counters and references 
> to cdata.)
> 
> If there really are fragments of data in _cmeta LV, something must have 
> gone wrong elsewhere.

The cache device is a Samsung 960 Evo M.2 2280 NVMe with a capacity of 250GB that was never used as anything but an LVM cache.  I would assume that all blocks that are no longer used would be discarded and thus cleared / set to zero?  Or do SSDs behave just like HDDs in this regard?

> You metadata device is just 48MB big - so maybe you could 'visually'
> recognize ending of  'btree'  data layout and cut away your 'private'
> content ?

The contents of cache_cmeta are very regular (a pattern of mostly 01 00 .. .. .. 00 00 00 with occasional zero blocks) up to byte 004aa000 / 4890624, where the first block of seemingly unstructured data is located.

I have attached a few bytes more (up to byte 004aa09f / 4890783) so you maybe recognize the start of that block, if it is actually part of the LVM cache metadata structure.

> Also note - there is tool   'cache_writeback'   - that should be able to
> offline flush cache data to origin device.

But would that not require the cache to be healthy?  I.e. would it work on a cache LV that has a "bad superblock", where even `lvconvert --repair` bails out?

> lvm2 version 178 should be able to activate  your _cmeta device 'standalone'
> (in read-only mode) for easier access to it's content.

Yes, I ran `lvchange -ay ernie/cache_cmeta` with 178 and that was somewhat successful in that it allowed me to take a backup of the hidden LVs, but also produced some error messages in dmesg (s.a.).

Comment 4 Zdenek Kabelac 2018-06-05 11:08:00 UTC
Passing to Joe.

There is needed enhancement of  cache tool support  (in similar way it's been made for thin tools)

Comment 5 Joe Thornber 2018-06-05 16:15:54 UTC
Hi,

I've spent the afternoon looking at your metadata.  I can see that there are
some of our btrees on there but the data doesn't make much sense.

The superblock at the start of the device contains some structured data, but it's definitely not a superblock.

To proceed I'd need the full contents of the metadata device, and to know the size of the SSD volume, and the origin device being cached.

Thanks,

- Joe

Comment 6 devurandom 2018-06-05 17:02:56 UTC
(In reply to Joe Thornber from comment #5)
> To proceed I'd need the full contents of the metadata device, and to know
> the size of the SSD volume, and the origin device being cached.

How should I send you this data?  Does Red Hat have a service were I could upload the files, knowing that they will be deleted once this issue is resolved?

Comment 7 Joe Thornber 2018-06-05 17:50:23 UTC
The metadata shouldn't have any of your data on it (you said you used fresh disks).

Probably the easiest thing is if you share it somehow (dropbox?) and I'll pull it.

Comment 8 devurandom 2018-06-06 06:59:35 UTC
(In reply to Joe Thornber from comment #5)
> To proceed I'd need the full contents of the metadata device, and to know
> the size of the SSD volume, and the origin device being cached.

You should be able to find that information in attachment #1447403 [details].

(In reply to Joe Thornber from comment #7)
> The metadata shouldn't have any of your data on it (you said you used fresh
> disks).

I can definitely see the content of emails there.  (See also comment #1.)

> Probably the easiest thing is if you share it somehow (dropbox?) and I'll
> pull it.

Please have a look at your mailbox.

Comment 9 Joe Thornber 2018-06-06 08:54:07 UTC
Nothing in my mail box (thornber@redhat.com).

Comment 10 Zdenek Kabelac 2018-06-06 08:58:17 UTC
Can you also please attach tar.gz of  /etc/lvm/archive  ?

Comment 11 devurandom 2018-06-06 09:21:57 UTC
(In reply to Joe Thornber from comment #9)
> Nothing in my mail box (thornber@redhat.com).

The went out at 06:59 UTC and did not bounce so far.  Maybe it ended up in your spam folder?

On a side note: Do you use PGP?

(In reply to Zdenek Kabelac from comment #10)
> Can you also please attach tar.gz of  /etc/lvm/archive  ?

Sorry, I can't, because the system partition is "behind" that cache with the bad superblock.

Comment 12 Zdenek Kabelac 2018-06-06 09:44:47 UTC
 
> (In reply to Zdenek Kabelac from comment #10)
> > Can you also please attach tar.gz of  /etc/lvm/archive  ?
> 
> Sorry, I can't, because the system partition is "behind" that cache with the
> bad superblock.


In that case please extract 1st.  1MiB of your origin PV with i.e. dd

dd if=/dev/sda of=/tmp/extracted bs=1M count=1

then please compress and attach.

It should be giving 'nearly' same info.

Comment 13 devurandom 2018-06-06 09:58:27 UTC
Created attachment 1448242 [details]
/etc/lvm/backup/ernie

etckeeper FTW, I found a copy of /etc/lvm/backup/  The version committed by etckeeper is from shortly (~7h) before I reported the crash on the linux-lvm ML.  /etc/lvm/archive/ did not exist when etckeeper created the backup.

If you still want the first 1MiB of the PV, please ping me again.

Comment 14 Zdenek Kabelac 2018-06-06 10:09:34 UTC
(In reply to devurandom from comment #13)
> Created attachment 1448242 [details]
> /etc/lvm/backup/ernie
> 
> etckeeper FTW, I found a copy of /etc/lvm/backup/  The version committed by
> etckeeper is from shortly (~7h) before I reported the crash on the linux-lvm
> ML.  /etc/lvm/archive/ did not exist when etckeeper created the backup.
> 
> If you still want the first 1MiB of the PV, please ping me again.

Yes please, grab it and attach.

We would like to see historical work with VG metadata.



It's very unclear how any of your user data might have landed on _cmeta device.

Since you mentioned issues with Ryzen - it could be result of some page-cache breakage??

Comment 15 devurandom 2018-06-07 03:45:10 UTC
Created attachment 1448588 [details]
head -c 1M nvme0n1.raw | xz > nvme0n1.raw.first-1048576-bytes.xz

(In reply to Zdenek Kabelac from comment #14)
> Yes please, grab it and attach.

Please find the first 1MiB of the SSD attached.  nvme0n1.raw was obtained on the broken system using `dd`.

> It's very unclear how any of your user data might have landed on _cmeta
> device.

I am curious, too.  Could it be that the NVMe SSD reused a written-to-but-then-discarded page of the cache_cdata volume in a still-unused part of the cache_cmeta volume?  I am not exactly sure how these devices are working internally and what guarantees they make to the OS...  The other option that came to my mind was that the CPU really writes garbage to the disks...

> Since you mentioned issues with Ryzen - it could be result of some
> page-cache breakage??

Yes, the system has issues during early boot: https://bugzilla.kernel.org/show_bug.cgi?id=196683#c347

I do not know the cause of this, though, or whether it might be related to the page-cache.  I would hope that AMD's examination of the affected CPU will shed some light on this.

Comment 16 Zdenek Kabelac 2018-06-09 19:43:43 UTC
Well the attachment of metadata revelead how the 'real user data' 'landed' in metadata part of device.

Originally  cache's metadata were placed at front of nvme0n1 (first 8 extents).
However  around dec 21.  (seq ~88)  - there was 'uncaching' - _pmspare has shifted on the place of 'origina' metadata.
And new caching has been allocated again this time using  extents 7+.
However  these extents were previously  used for cache data volume.

So whatever was there before has made its way into into _cmeta volume content.

To avoid this leakage - user may (ATM) use  lvm.conf  'issue_discard=1' - so ANY free extents are automatically discarded - which 'typically' gives back zeroed portion of device upon next usage (but also make operation like vgcfgrestore pointless - as there are simply all data on SSD/NVMe devices already discarded)


lvm2 could be probably enhanced to call discard  metadata devices prior zeroing their metadata header.
However since not all SSDs are known to realy 'zero' blocks - we might possibly end-up with the need to zero whole metadata device (up-to 16GiB of data0).

Unsure which path will be selected - and it's probably worth making this as separate BZ -  as it's in effect unrelated to  'missing' cache metadata superblock content.


Getting back to analysis of metadata on disk - there is visible suspicious operation happening after seqno 131 -  Apr 28  13:46

follows  seqno=69    Feb 17 20:24 ???
offset: 0x7a00

follows seqno=132    May 26  06:07

A may just assume - it's been temporary issue in storing matadata, where
new alignment code forget to 'zero' data between new page aligned address
(worth extra check - as there should not be left old metadata between newly written content)

Other then that - the cache metadata are now placed at extent 12+ - which is dated on  Mar 6 21:11  seqno==100

So we may assume these 12 extents (48Mib) used by _cmeta - from caching moment 
Apr 28  14:17    seqno==130    have been there and has not moved.

But for reasons mentioned above -  their 'unwritten' areas may contain piece of old _cdata volume content.

Comment 17 devurandom 2018-06-12 05:57:54 UTC
(In reply to Zdenek Kabelac from comment #16)

A brief history of the device:

Mid February I received the new machine, including the new SSD (Samsung 960 EVO M.2 NVMe 250G).  I uncached the volume, removed the old SSD and added the new SSD as cache [1].  That is why there might be several changes to the LVs at around that time.  Further, the new machine was unstable as described above and I restored it from backups a few times from thereon.  At around mid April the new firmware arrived, which offered an option that would make the system run stably after boot.

[1]: The creation of new caches always takes me several iterations of creating and deleting LVs, because the support for "I have a PV of size X, allocate the LVs necessary for the cache on the same PV accordingly" seems to be non-existent in the tooling.  Instead I had to tell LVM that I wanted a cache LV of size Y and adjust that size until the metadata volumes necessary to support it would also fit on the device.


In comment #4 you mentioned cache tool enhancements that were necessary.  Might any of those allow me to restore the contents of my cache and do you have a rough ETA for the first version of the patches?  Or is it safe to assume that the current cache is forever lost and I should just try to restore the contents of the data volume directly instead?

Comment 18 devurandom 2018-06-14 02:23:43 UTC
I assume there is no way to restore the cache (in comment #17 I was still hoping there would be a way  -- should you need someone to test new functionality, I will keep the backups for a while).

Hence, I just tried to uncache the LV forcefully or find another way to access "system_corig" in read-write mode, in order to regain access to its contents.  Is there any way to do that using LVM tooling?  Or do I have to backup system_corig in read-only mode, delete the VG, recreate it and its PVs and LVs, and restore system_corig from the backup?  The latter is the only way I found so far, but the former seems much faster (if it was supported by the tools), since it would just be setting some flags on the LVs.  Is this the missing tooling you mentioned in comment #4?


Note You need to log in before you can comment on or make changes to this bug.