Description of problem: vdsm fails with the following error when trying to add a VM lease: ~~~ ERROR (tasks/3) [storage.TaskManager.Task] (Task='11111111-1111-1111-1111-111111111111') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 2061, in create_lease dom.create_lease(lease.lease_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 1241, in create_lease with self._manifest.external_leases_volume() as vol: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 659, in external_leases_volume vol = xlease.LeasesVolume(backend) File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 493, in __init__ self._md = self._index.read_metadata() File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 734, in read_metadata return IndexMetadata.fromebytes(data) File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 308, in fromebytes raise InvalidMetadata("invalid magic: %s" % magic, data) InvalidMetadata: Invalid index metadata (invalid magic: 0), the index must be formatted or rebuilt from storage: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' ~~~ Version-Release number of selected component (if applicable): How reproducible: Always with a unformatted xleases volume Steps to Reproduce: 1. Try to add a lease to an HA VM Actual results: The operation fails with the error above Expected results: The operation finish with no errors. Additional info: The xleases volume seems to be unformatted/uninitialized. The SD domain format is V4 within the engine database. The SD domain volume group tags contain MDT_VERSION=4. The vdsm logs were rotated and do not contain any information about the SD upgrade to V4 format or about its creation. There is no way to reformat the xleases volume via the API or the vdsm-{tool,client} commands.
(In reply to Miguel Martin from comment #0) The failure to add a lease is expected, the xleases volume index is invalid: InvalidMetadata: Invalid index metadata (invalid magic: 0), the index must be formatted or rebuilt from storage: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\... According to Miguel, the entire volume is empty and was never used. It looks like a volume that was created but not formatted, or wiped by someone. There are 2 issues: 1. Why the xleases volume is wiped? Is this a bug in the system failing to format the volume and leaving the domain with unformatted volume, or a user error? 2. There is no easy way to recover from this situation. Migual: - Can you explain here what is the workaround, or link to an support article about this? - Please attach vdsm logs since this domain was created until this error.
Germano, please review the patches, and maybe get other people from the team to review them. Lets me explain the solution: - https://gerrit.ovirt.org/c/69189 - implement rebuilding index from storage contents. We assume that sanlock leases on storage are the truth. - https://gerrit.ovirt.org/c/69190 - wire up Lease.rebuild_leases so it can be used from vdsm-client. This is a safe way to rebuild the leases, but it will work only on the SPM, and only if the domain is active, and only if vdsm is running. This is safe operation. - https://gerrit.ovirt.org/c/69193 - allow rebuilding on any host even if vdsm is not running. This goes directly to storage and is like "sanlock direct xxx" commands, or like writing to storage directly with dd. This is dangerous and require that the storage domain is not active! - https://gerrit.ovirt.org/c/69069 - allow formatting the index, dropping all entries. This is even more dangerous than rebuilding the index. However this does not drop the sanlock leases on storage, so you can rebuild the index after formatting. These patches give support lot of power, but this also means you are responsible when you use these commands incorrectly.
*** Bug 1654419 has been marked as a duplicate of this bug. ***
Hi Nir! (In reply to Nir Soffer from comment #2) > Germano, please review the patches, and maybe get other people from the team > to review them. Done. > These patches give support lot of power, but this also means you are > responsible when you use these commands incorrectly. I can see it ;) Nice tool, will certainly be useful. Thanks!
The steps to reproduce from the description are too generic. We have this scenario covered in our regression testing and we've never hit it. Miguel, can you please provide more specifics?
Hello Elad, I am sorry but there is no additional information about the upgrade or the creation of the affected storage domain as the logs were rotated, so we don't know how this happened. The only thing we know is the 'xleases' volume is/was zeroed. A simple way to reproduce the mentioned exception is zeroing the 'xleases' volume with 'dd'.
Vojta, can you explain here how to corrupt the volume and how to test rebuilding and formatting the xleases volume?
Hi Elad, as Miguel mentioned, sequence of steps which resulted into this error is unfortunately still unknown. You can reproduce the issue with dd as follows. Reproducer: * create HA VM * check Highly available checkbox in High availability tab and select a storage for lease in "Target Storage Domain for VM Lease". This is the storage you will corrupt with dd in the next step * you can check lease is created e.g. by running `vdsm-client -f lease Lease info` * corrupt/zero xlease volume with dd * dd if=/dev/zero bs=512 seek=2048 of=/dev/domain-uuid/xleases oflag=direct conv=fsync,notrunc count=1 * now creating another HA VM should fail * when you run `vdsm-client -f lease Lease info` you should see very similar output to one described in this BZ (without exception, just messages that command failed and index must be rebuild and lots of \x00) Testing the fix: * rebuilding the index * run reproducer * run `vdsm-tool rebuild-xleases domain-uuid /dev/domain-uuid/xleases` to rebuild the index * run `vdsm-client -f lease Lease info` to verify correct lease ID is back again * formatting the index * run reproducer * run `vdsm-tool format-xleases domain-uuid /dev/domain-uuid/xleases` * create new HA VM, VM should be created To get some more info how to use rebuild-xleases and format-xleases, you can check, besides vdsm help, appropriate commit messages: rebuild-xleases: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=7e275628751eee53d6f1c59dea9678e8cb2bec9f format-xleases: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=3acfdbdffbe9cdbf6d72ab3ba586e3141b56d334
One more update to the reproducer: 'lease' in `vdsm-client -f lease Lease info` means json file which identifies the lease, looks e.g. like this: { "lease": { "sd_id": "e06def90-e01b-402e-bf1d-29760bfe71ce", "lease_id": "195d618a-a977-4c6c-b0db-dc762d8d2e23" } } where sd_id is storage domain ID and lease_id is ID of the lease. Both should be visible in engine UI.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
This was merged ages ago, and included in 4.3.0. Moving to modified.
1) Created a VM with a lease on NFS domain. [root@storage-ge3-vdsm1 ~]# vdsm-client -f lease Lease info { "path": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases", "lease_id": "f3a6c54b-f11a-4440-b37c-c6a20324cd12", "sd_id": "496ff642-52c9-4354-9a09-8e4d1378e068", "offset": 3145728 } Where 'lease' is: { "lease": { "sd_id": "496ff642-52c9-4354-9a09-8e4d1378e068", "lease_id": "f3a6c54b-f11a-4440-b37c-c6a20324cd12" } } 2) Destroyed 'xleases' volume: [root@storage-ge3-vdsm1 ~]# dd if=/dev/zero bs=512 seek=2048 of=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases oflag=direct conv=fsync,notrunc count=1 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.00193736 s, 264 kB/s 3) Created another VM with lease on the same storage. This failed with the following: 2019-03-10 18:03:51,341+0200 DEBUG (tasks/0) [storage.xlease] Loading index from u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d13 78e068/dom_md/xleases' (xlease:511) 2019-03-10 18:03:51,372+0200 ERROR (tasks/0) [storage.TaskManager.Task] (Task='08668583-385e-4261-9897-c33f57e147f7') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 2058, in create_lease dom.create_lease(lease.lease_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 1259, in create_lease with self._manifest.external_leases_volume() as vol: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 661, in external_leases_volume vol = xlease.LeasesVolume(backend) File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 516, in __init__ self._md = self._index.read_metadata() File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 819, in read_metadata return IndexMetadata.fromebytes(data) File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 331, in fromebytes raise InvalidMetadata("invalid magic: %s" % magic, data) InvalidMetadata: Invalid index metadata (invalid magic: 0), the index must be formatted or rebuilt from storage: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0 4) Rebuilt the index: [root@storage-ge3-vdsm1 ~]# vdsm-tool rebuild-xleases domain-uuid /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md//xleases 5) Verified that the lease is valid again: [root@storage-ge3-vdsm1 ~]# vdsm-client -f lease Lease info { "path": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases", "lease_id": "f3a6c54b-f11a-4440-b37c-c6a20324cd12", "sd_id": "domain-uuid", "offset": 3145728 } 6) Created a new VM with lease on the same domain, succeeded: [root@storage-ge3-vdsm1 ~]# vdsm-client -f lease Lease info { "path": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge3__nfs__1/496ff642-52c9-4354-9a09-8e4d1378e068/dom_md/xleases", "lease_id": "0f28d3b0-46fd-457c-8b74-8883eb348b13", "sd_id": "domain-uuid", "offset": 4194304 } ============================================== Used: vdsm-4.30.9-1.el7ev.x86_64 rhvm-4.3.1.1-0.1.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077
sync2jira