1773421 – [RFE] Need a recovery tool in VDO for devices with superblocks corrupted or wiped out dmvdo001 labels

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1773421 - [RFE] Need a recovery tool in VDO for devices with superblocks corrupted or wiped out dmvdo001 labels

Summary: [RFE] Need a recovery tool in VDO for devices with superblocks corrupted or w...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	vdo
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	8.0
Assignee:	corwin
QA Contact:	Filip Suba
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-18 06:28 UTC by nikhil kshirsagar
Modified:	2023-10-06 18:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:	6.2.3.91
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-04 02:01:16 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	corwin: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4551	0	None	None	None	2020-11-04 02:01:46 UTC

Internal Links: 1969213

Description nikhil kshirsagar 2019-11-18 06:28:01 UTC

Description of problem:
Several customers reported issues with VDO devices when due to various reasons, the VDO device is missing the label (dmvdo001)

We need the equivalent of a vgcfgrestore at the vdo level in order to recover such devices, assuming the data is not affected.

For eg, sdb1 is a vdo device, which after reboot comes up with the label missing. It's unclear how this could have happened, particularly after a clean reboot, but our best guess is some third party like asm has taken those devices for itself, and wiped it clean of the VDO metadata.


[]# hexdump -C -n512 /dev/sdb1
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|



Additional info:

In LVM we usually do a pvcreate --restoremissing with the proper UUID (which we get from metadata stored in /etc/lvm/backup and then a vgcfgrestore usually gets us back to a point where we can begin using the device again, with all data intact). We need something equivalent in VDO.

More details:


o VDO Status: Bad magic number

## Error message:

$ vdo start --all
Starting VDO graylog
VDO instance 0 volume is ready at /dev/mapper/graylog
Starting VDO elastic
vdo: ERROR - Could not set up device mapper for elastic
vdo: ERROR - vdodumpconfig: allocateVDO failed for '/dev/disk/by-partuuid/98711647-e8f1-4e57-bcbd-9bb09e5b9e1d' with VDO Status: Bad magic number

## Findings:

o 2 VDO Volumes
  - elastic  (problematic)
  - graylog  (healthy)

o sdb1 (unhealthy)

  00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  *
  00001000  31 30 6e 67 52 62 6c 41  d7 5c 0a 00 00 00 00 00  |10ngRblA.\......|  <== 0x00001000
  00001010  01 00 01 00 07 00 68 00  01 00 00 00 00 00 00 00  |......h.........|
  00001020  01 00 00 00 00 00 00 00  00 00 00 00 01 00 ff ff  |................|

  <cut>

o sdc1 (healthy)

  00000000  64 6d 76 64 6f 30 30 31  05 00 00 00 04 00 00 00  |dmvdo001........|
  00000010  00 00 00 00 5d 00 00 00  00 00 00 00 09 01 02 00  |....]...........|
  00000020  22 06 38 10 0f 97 05 00  5e 7f c1 9d 18 ef 44 19  |".8.....^.....D.|
  00000030  a9 3e 7d c6 1a f7 e8 79  00 00 00 00 01 00 00 00  |.>}....y........|
  00000040  00 00 00 00 01 00 00 00  d8 5c 0a 00 00 00 00 00  |.........\......|
  00000050  00 ff ff ff 00 00 00 00  00 ce be 76 b5 00 00 00  |...........v....|
  00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  *
  00001000  31 30 6e 67 52 62 6c 41  d7 5c 0a 00 00 00 00 00  |10ngRblA.\......|
  00001010  01 00 01 00 07 00 68 00  01 00 00 00 00 00 00 00  |......h.........|
  00001020  01 00 00 00 00 00 00 00  00 00 00 00 01 00 ff ff  |................|

  <cut>

We can see that the start of the disk upto  0x00001000  is missing (corrupt) on the problematic disk (/dev/sdb1)


This is also discussed in https://github.com/dm-vdo/vdo/issues/22 by the same customer.

I discussed with Andy Walsh, pasting the conversation below,

-----------------------------------


<awalsh|away> nkshirsa: Sorry, I meant to respond to you on Friday, but I lost track.  There is no way to recover from that situation.  One of the main reasons is that if this is unexpected, you have no way of knowing the extent of the damage.  If the user accidentally wiped the superblock from the device with something like wipefs or dd, or some way that you can be confident that the rest of the volume is intact, you can re-write that superblo
<nkshirsa> awalsh|away, thanks .. 
<nkshirsa> awalsh|away, i think they've done a wipefs
<awalsh|away> nkshirsa: I will reply to the email thread in the morning.
<nkshirsa> awalsh|away, yes i know shivam has started a mail thread
<nkshirsa> i shall let him know 
<awalsh|away> nkshirsa: If they can be sure that's all they did, then a dd of the geometry block would get the volume back.  But if they did more than that, then the damage can be unrecoverable.
<nkshirsa> awalsh|away, but they dont have a similar sized vdo device to get the dd back from ... can we re=create the superblock from the code?
<awalsh|away> Possibly.  We don't have a process for it.  It might just be a matter of re-generating the 'dmvdo001' signature and that's it.  but I can't say for certain.
<nkshirsa> so we need the equivalent of a pvcreate --restorefile ? 
<awalsh|away> maybe?  I don't know that command.
<nkshirsa> sometihng to stamp the label back 
<nkshirsa> like lvm has labelone

<awalsh|away> Yeah, if that's all that was removed by the wipefs, then simply rewriting that could do the trick.
<nkshirsa> awalsh|away, so how to rewrite?
<awalsh|away> very narrow use case, but could work.
<awalsh|away> atm, I dunno.  If just the superblock, then you might be able to take it off of a healthy volume.  But if it's more than just the signature, then we'd have to be able to reconstruct the config first.
<nkshirsa> so you mean dd in the first 512 bytes of another similar vdo device ?
<nkshirsa> into the corrupt device ?
<awalsh|away> That's my thought.  but I'm not very familiar with that code.  So it may or may not actually work.
<nkshirsa> awalsh|away, this is the situation .. 
<nkshirsa> http://pastebin.test.redhat.com/814675
<awalsh|away> Yeah, I read through that in the email.
<nkshirsa> can we copy first 512 bytes from sdc1 and dd them into sdb1 ? 
<awalsh|away> I'm not certain.  Looking at that dump, my initial thought is no.  It looks like there's more data than just the signature in that block.  likely the configuration of the volume is in there.
<nkshirsa> awalsh|away, so is there anything we can do to help recover this customers data ?
<nkshirsa> at this point ?
<awalsh|away> Do they know what their create parameters were?
<nkshirsa> we will check with them 
<awalsh|away> I just tried a wipefs on a VDO volume.
<awalsh|away> It only wipes the dmvdo001 part.
<awalsh|away> So I'm not sure they did a wipefs.
<awalsh|away> I want to talk to the team a bit more about this tomorrow.  But I was able to wipefs one volume, and then dd the signature back onto that volume from another and start it.  This is really not a good place to be in, so we need to be very careful on how much we help in these situations.
<awalsh|away> We should be very explicit that only if the customer knows exactly what happened that this is a thing that can be tried.  Otherwise the volume is suspect and shouldn't be trusted for data integrity.
<nkshirsa> this customer tells us that it was a clean shutdown after adding 2 vcpu cores
<nkshirsa> but that makes no sense.. something has overwritten this vdo device beginning.. we're not sure what.. 
<awalsh|away> yeah.  So if we restore the geometry block, who is to say that the data region of VDO isn't also damaged beyond repair?
<nkshirsa> hmm so i guess there's nothing we can do here then 
<nkshirsa> i wish there was a utility that would take a backup of a vdo device , the part before the data begins, so we can restore from it in case we run into this situation again. something like a vgcfgrestore
<awalsh|away> My answer is probably not.  If they're desperate, then we can try to dd the superblock back on there, but I wouldn't trust it.
<nkshirsa> (here we're not even aware of the uuid.. ) 
<nkshirsa> since its not just the label thats gone
<awalsh|away> Yeah, I suppose.  But the use case is pretty limited.
<nkshirsa> honestly, because we have customers often having luns mapped incorrectly to asm, or other third party applications, after reboots, those luns get initialized by those third parties, and thus we lose metadata.. usually happens upon a reboot.
<nkshirsa> so its important to have a recovery mechanism in vdo for such cases
<nkshirsa> oracle often ends up overwriting lvm metadata with partitions and we have to vgcfgrestore them.. 
<awalsh|away> Least case, they wipe the superblock and restore it.  Another situation could be the wiped into the UDS index, which invalidates all their dedupe advice.  Not fatal, but not optimal.  If they wiped all the way into the data region, then their volume is unuseable.
<nkshirsa> my guess is, as more customers begin to use vdo, this will be a common occurrence
<awalsh|away> Ok
<awalsh|away> I think the question from there is how much of the VDO blocks should we back up?  Just the blocks starting from address 0, and nothing else?  That's probably as far as I'd be comfortable with.
<nkshirsa> then we have the usual "created a partition by mistake" on the device kind of use cases.. 
<awalsh|away> yeah
<awalsh|away> Can you file an RFE for that?
<nkshirsa> just the metadata .. until the start of the data.. 
<nkshirsa> ack, i will file it today afternoon
<awalsh|away> Great.  We can clarify any questions in the ticket at that point.
<nkshirsa> cool
<nkshirsa> thanks !
---------------------------

Comment 8 Filip Suba 2020-09-09 08:57:52 UTC

Verified with vdo-support-6.2.3.114-14.el8.

Comment 11 errata-xmlrpc 2020-11-04 02:01:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (kmod-kvdo bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4551

Note You need to log in before you can comment on or make changes to this bug.