Bug 1654417 - vdsm fails with Invalid index metadata (invalid magic: 0) when trying to add a VM lease
Summary: vdsm fails with Invalid index metadata (invalid magic: 0) when trying to add ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.3.1
: 4.3.0
Assignee: Vojtech Juranek
QA Contact: Elad
URL:
Whiteboard:
: 1654419 (view as bug list)
Depends On:
Blocks: 1654419 1656815
TreeView+ depends on / blocked
 
Reported: 2018-11-28 18:15 UTC by Miguel Martin
Modified: 2022-03-13 16:35 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, if an xlease volume was corrupted, VDSM could not acquire leases and features like high-availability virtual machines did not work. The current release adds rebuild-xleases and format-xleases commands to the VDSM tool. Administrators can use these commands to rebuild or format corrupted xlease volumes.
Clone Of:
: 1654419 1656815 (view as bug list)
Environment:
Last Closed: 2019-05-08 12:36:02 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-45225 0 None Closed /usr/lib/tmpfiles.d configuration issue during update to 8.4 2022-05-17 05:56:48 UTC
Red Hat Knowledge Base (Solution) 3723131 0 None None None 2018-11-29 11:52:23 UTC
Red Hat Product Errata RHBA-2019:1077 0 None None None 2019-05-08 12:36:24 UTC
oVirt gerrit 69069 0 master MERGED vdsm-tool: Add format-xleases command 2020-10-08 14:31:19 UTC
oVirt gerrit 69189 0 master MERGED xleases: Implement rebuild_index 2020-10-08 14:31:19 UTC
oVirt gerrit 69190 0 master MERGED xleases: Wire the rebuild_leases API 2020-10-08 14:31:19 UTC
oVirt gerrit 69193 0 master MERGED vdsm-tool: Add rebuild-xleases command 2020-10-08 14:31:19 UTC
oVirt gerrit 95975 0 ovirt-4.2 MERGED xleases: Implement rebuild_index 2020-10-08 14:31:20 UTC
oVirt gerrit 95976 0 ovirt-4.2 MERGED xleases: Wire the rebuild_leases API 2020-10-08 14:31:20 UTC
oVirt gerrit 95977 0 ovirt-4.2 MERGED vdsm-tool: Add format-xleases command 2020-10-08 14:31:20 UTC
oVirt gerrit 95978 0 ovirt-4.2 MERGED vdsm-tool: Add rebuild-xleases command 2020-10-08 14:31:20 UTC

Description Miguel Martin 2018-11-28 18:15:32 UTC
Description of problem:

vdsm fails with the following error when trying to add a VM lease:
~~~
ERROR (tasks/3) [storage.TaskManager.Task] (Task='11111111-1111-1111-1111-111111111111') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 2061, in create_lease
    dom.create_lease(lease.lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 1241, in create_lease
    with self._manifest.external_leases_volume() as vol:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 659, in external_leases_volume
    vol = xlease.LeasesVolume(backend)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 493, in __init__
    self._md = self._index.read_metadata()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 734, in read_metadata
    return IndexMetadata.fromebytes(data)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 308, in fromebytes
    raise InvalidMetadata("invalid magic: %s" % magic, data)
InvalidMetadata: Invalid index metadata (invalid magic: 0), the index must be formatted or rebuilt from storage: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
~~~


Version-Release number of selected component (if applicable):


How reproducible:
Always with a unformatted xleases volume

Steps to Reproduce:
1. Try to add a lease to an HA VM


Actual results:
The operation fails with the error above

Expected results:
The operation finish with no errors.


Additional info:

The xleases volume seems to be unformatted/uninitialized.
The SD domain format is V4 within the engine database.
The SD domain volume group tags contain MDT_VERSION=4.

The vdsm logs were rotated and do not contain any information about the SD upgrade to V4 format or about its creation.

There is no way to reformat the xleases volume via the API or the vdsm-{tool,client} commands.

Comment 1 Nir Soffer 2018-11-28 18:40:43 UTC
(In reply to Miguel Martin from comment #0)
The failure to add a lease is expected, the xleases volume index is invalid:

InvalidMetadata: Invalid index metadata (invalid magic: 0), the index must be
formatted or rebuilt from storage:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\...

According to Miguel, the entire volume is empty and was never used. It looks
like a volume that was created but not formatted, or wiped by someone.

There are 2 issues:

1. Why the xleases volume is wiped? Is this a bug in the system failing to
   format the volume and leaving the domain with unformatted volume, or a user
   error?

2. There is no easy way to recover from this situation.

Migual:

- Can you explain here what is the workaround, or link to an support article
  about this?
- Please attach vdsm logs since this domain was created until this error.

Comment 2 Nir Soffer 2018-11-29 11:28:55 UTC
Germano, please review the patches, and maybe get other people from the team to
review them.

Lets me explain the solution:

- https://gerrit.ovirt.org/c/69189 - implement rebuilding index from storage
  contents. We assume that sanlock leases on storage are the truth.

- https://gerrit.ovirt.org/c/69190 - wire up Lease.rebuild_leases so it can be
  used from vdsm-client. This is a safe way to rebuild the leases, but it will
  work only on the SPM, and only if the domain is active, and only if vdsm is 
  running. This is safe operation.

- https://gerrit.ovirt.org/c/69193 - allow rebuilding on any host even if vdsm is
  not running. This goes directly to storage and is like "sanlock direct xxx"
  commands, or like writing to storage directly with dd. This is dangerous and
  require that the storage domain is not active!

- https://gerrit.ovirt.org/c/69069 - allow formatting the index, dropping all
  entries. This is even more dangerous than rebuilding the index. However this
  does not drop the sanlock leases on storage, so you can rebuild the index after
  formatting.

These patches give support lot of power, but this also means you are responsible
when you use these commands incorrectly.

Comment 6 Miguel Martin 2018-11-29 11:52:23 UTC
*** Bug 1654419 has been marked as a duplicate of this bug. ***

Comment 7 Germano Veit Michel 2018-11-30 06:38:00 UTC
Hi Nir!

(In reply to Nir Soffer from comment #2)
> Germano, please review the patches, and maybe get other people from the team
> to review them.

Done.

> These patches give support lot of power, but this also means you are
> responsible when you use these commands incorrectly.

I can see it ;)

Nice tool, will certainly be useful. Thanks!

Comment 11 Elad 2018-12-15 12:17:17 UTC
The steps to reproduce from the description are too generic. We have this scenario covered in our regression testing and we've never hit it. 
Miguel, can you please provide more specifics?

Comment 12 Miguel Martin 2018-12-17 10:18:47 UTC
Hello Elad,

I am sorry but there is no additional information about the upgrade or the creation of the affected storage domain as the logs were rotated, so we don't know how this happened.
The only thing we know is the 'xleases' volume is/was zeroed. A simple way to reproduce the mentioned exception is zeroing the 'xleases' volume with 'dd'.

Comment 13 Nir Soffer 2018-12-17 13:15:19 UTC
Vojta, can you explain here how to corrupt the volume and how to test
rebuilding and formatting the xleases volume?

Comment 14 Vojtech Juranek 2018-12-18 09:16:40 UTC
Hi Elad,
as Miguel mentioned, sequence of steps which resulted into this error is unfortunately still unknown. You can reproduce the issue with dd as follows.

Reproducer:
* create HA VM
  * check Highly available checkbox in High availability tab and select a storage for lease in "Target Storage Domain for VM Lease". This is the storage you will corrupt with dd in the next step
  * you can check lease is created e.g. by running `vdsm-client -f lease Lease info`
* corrupt/zero xlease volume with dd
  * dd if=/dev/zero bs=512 seek=2048 of=/dev/domain-uuid/xleases oflag=direct conv=fsync,notrunc count=1
  * now creating another HA VM should fail
  * when you run `vdsm-client -f lease Lease info` you should see very similar output to one described in this BZ (without exception, just messages that command failed and index must be rebuild and lots of \x00)

Testing the fix:
* rebuilding the index
  * run reproducer
  * run `vdsm-tool rebuild-xleases  domain-uuid /dev/domain-uuid/xleases` to rebuild the index
  * run `vdsm-client -f lease Lease info` to verify correct lease ID is back again
* formatting the index
  * run reproducer
  * run `vdsm-tool format-xleases domain-uuid /dev/domain-uuid/xleases`
  * create new HA VM, VM should be created

To get some more info how to use rebuild-xleases and format-xleases, you can check, besides vdsm help, appropriate commit messages:
rebuild-xleases: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=7e275628751eee53d6f1c59dea9678e8cb2bec9f
format-xleases: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=3acfdbdffbe9cdbf6d72ab3ba586e3141b56d334

Comment 15 Vojtech Juranek 2019-01-15 14:28:52 UTC
One more update to the reproducer:
'lease' in `vdsm-client -f lease Lease info` means json file which identifies the lease, looks e.g. like this:

{ 
  "lease": {
           "sd_id": "e06def90-e01b-402e-bf1d-29760bfe71ce", 
           "lease_id": "195d618a-a977-4c6c-b0db-dc762d8d2e23"
           }
}

where sd_id is storage domain ID and lease_id is ID of the lease. Both should be visible in engine UI.

Comment 16 Sandro Bonazzola 2019-01-28 09:44:12 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 18 Nir Soffer 2019-01-30 22:13:19 UTC
This was merged ages ago, and included in 4.3.0. Moving to modified.

Comment 20 Elad 2019-03-10 16:19:57 UTC
1) Created a VM with a lease on NFS domain.

[root@storage-ge3-vdsm1 ~]# vdsm-client -f lease Lease info
{
    "path": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases", 
    "lease_id": "f3a6c54b-f11a-4440-b37c-c6a20324cd12", 
    "sd_id": "496ff642-52c9-4354-9a09-8e4d1378e068", 
    "offset": 3145728
}

Where 'lease' is:

{
  "lease": {
           "sd_id": "496ff642-52c9-4354-9a09-8e4d1378e068",
           "lease_id": "f3a6c54b-f11a-4440-b37c-c6a20324cd12"
           }
}


2) Destroyed 'xleases' volume:


[root@storage-ge3-vdsm1 ~]# dd if=/dev/zero bs=512 seek=2048 of=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases oflag=direct conv=fsync,notrunc count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00193736 s, 264 kB/s


3) Created another VM with lease on the same storage. This failed with the following:

2019-03-10 18:03:51,341+0200 DEBUG (tasks/0) [storage.xlease] Loading index from u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d13
78e068/dom_md/xleases' (xlease:511)
2019-03-10 18:03:51,372+0200 ERROR (tasks/0) [storage.TaskManager.Task] (Task='08668583-385e-4261-9897-c33f57e147f7') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 2058, in create_lease
    dom.create_lease(lease.lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 1259, in create_lease
    with self._manifest.external_leases_volume() as vol:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 661, in external_leases_volume
    vol = xlease.LeasesVolume(backend)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 516, in __init__
    self._md = self._index.read_metadata()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 819, in read_metadata
    return IndexMetadata.fromebytes(data)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 331, in fromebytes
    raise InvalidMetadata("invalid magic: %s" % magic, data)
InvalidMetadata: Invalid index metadata (invalid magic: 0), the index must be formatted or rebuilt from storage: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0


4) Rebuilt the index:

[root@storage-ge3-vdsm1 ~]# vdsm-tool rebuild-xleases  domain-uuid /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md//xleases


5) Verified that the lease is valid again:

[root@storage-ge3-vdsm1 ~]# vdsm-client -f lease Lease info
{
    "path": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases", 
    "lease_id": "f3a6c54b-f11a-4440-b37c-c6a20324cd12", 
    "sd_id": "domain-uuid", 
    "offset": 3145728
}

6) Created a new VM with lease on the same domain, succeeded:

[root@storage-ge3-vdsm1 ~]# vdsm-client -f lease Lease info
{
    "path": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases", 
    "lease_id": "0f28d3b0-46fd-457c-8b74-8883eb348b13", 
    "sd_id": "domain-uuid", 
    "offset": 4194304
}


==============================================
Used:
vdsm-4.30.9-1.el7ev.x86_64
rhvm-4.3.1.1-0.1.el7.noarch

Comment 23 errata-xmlrpc 2019-05-08 12:36:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 24 Daniel Gur 2019-08-28 13:13:12 UTC
sync2jira

Comment 25 Daniel Gur 2019-08-28 13:17:25 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.