Description of problem: We cannot allow a SD to be detached while there are image operations that need a recovery flow. Several Snapshot Removal flows, when interrupted can lead to Illegal Images in the SD. I believe there are also cases where the generated OVFs may not reflect the current chain in the SD, so the VM import may fail looking for images that don't exist or the wrong chain, or even import fine using images that were already merged, outdated, and should be deleted. There are likely literally dozens of cases where things can go wrong and we fail to remove a snapshot, which needs to be retried. The OVF does NOT carry all the image bits from the DB. It only carries the chain and does not contain ANY information if a snapshot removal needs to be retried and/or go to recovery flow (i.e.: no illegal info) Version-Release number of selected component (if applicable): How reproducible: 100% - several ways Steps to Reproduce: 1. Start Live Removal of Snapshot S1 2. Reset Host, killing the VM 3. Snapshot Removal Fails, one image in chain is illegal 4. Detach SD 5. Attach SD 6. Import VM 7. Chain is now seen in the Engine as healthy (no illegal images) 8. VM fails to start (illegal image in chain) 9. Same snapshot fail to remove (retry flow Logs: 2016-10-13 01:57:06,717 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [3cd2bd32] Correlation ID: 3cd2bd32, Job ID: 8ca792cb-c249-4241-ab02-fbf7d4036760, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'. 2016-10-13 02:01:40,719 INFO [org.ovirt.engine.core.utils.archivers.tar.TarInMemoryExport] (org.ovirt.thread.pool-6-thread-43) [269a8b50] Retrieve OVF Entity from storage domain ID 'bfed9c9a-30c3-45d9-afaf-9e695ad9e2f5' for entity ID '4f1d68ed-05a2-493c-a849-45f029ffb3b7', entity name 'CentOS7' and VM Type of 'VM' 2016-10-13 02:01:41,053 INFO [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-43) [560fe0a5] Adding OVF data of entity id '4f1d68ed-05a2-493c-a849-45f029ffb3b7' and entity name 'CentOS7' Job ID: 82460d6d-1843-48f4-bb41-6023ac16d714, Call Stack: null, Custom Event ID: -1, Message: VM CentOS7 was started by admin@internal-authz (Host: RHVH-20). 2016-10-13 02:02:30,086 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] VM '4f1d68ed-05a2-493c-a849-45f029ffb3b7'(CentOS7) moved from 'WaitForLaunch' --> 'Down' 2016-10-13 02:02:30,122 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-5) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM CentOS7 is down with error. Exit message: Bad volume specification {u'index': 0, u'domainID': u'bfed9c9a-30c3-45d9-afaf-9e695ad9e2f5', 'reqsize': '0', u'format': u'cow', u'bootOrder': u'1', u'address': {u'function': u'0x0', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'slot': u'0x06'}, u'volumeID': u'6c1749d9-86f8-431a-bb0c-d8f138ab6dc4', 'apparentsize': '1073741824', u'imageID': u'6aa3112f-eb2f-46d1-a9a6-23a4123a8d8f', u'specParams': {}, u'readonly': u'false', u'iface': u'virtio', u'optional': u'false', u'deviceId': u'6aa3112f-eb2f-46d1-a9a6-23a4123a8d8f', 'truesize': '1073741824', u'poolID': u'00000001-0001-0001-0001-0000000002f3', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}. 2016-10-13 02:02:32,164 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] add VM '4f1d68ed-05a2-493c-a849-45f029ffb3b7'(CentOS7) to rerun treatment 2016-10-13 02:02:32,260 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-47) [] Correlation ID: 469132a9, Job ID: 82460d6d-1843-48f4-bb41-6023ac16d714, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM CentOS7 on Host RHVH-10. 2016-10-13 02:08:18,828 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-24) [7572b924] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'S1' deletion for VM 'CentOS7' was initiated by admin@internal-authz. 2016-10-13 02:08:22,436 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-12) [40bd694f] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Due to partial snapshot removal, Snapshot 'S1' of VM 'CentOS7' now contains only the following disks: 'CentOS7_Disk1'. 2016-10-13 02:08:22,493 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-12) [] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'. And try again: 2016-10-13 02:19:43,407 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-24) [3043e221] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'S1' deletion for VM 'CentOS7' was initiated by admin@internal-authz. 2016-10-13 02:19:44,788 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-7) [7748262d] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Due to partial snapshot removal, Snapshot 'S1' of VM 'CentOS7' now contains only the following disks: 'CentOS7_Disk1'. 2016-10-13 02:19:44,851 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-7) [] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'. Actual results: VM is unusable, no idea on how to recover it via GUI. Probably needs manual intervention to recover. Expected results: Do not allow to detach SD See also: BZ 1372163 - [RFE] Warn user about VMs that have pending snapshot removal retries
Sorry, forgot this: Version-Release number of selected component (if applicable): ovirt-engine-4.0.4.4-0.1.el7ev.noarch vdsm-4.18.11-1.el7ev.x86_64
We can't block detaching if a merge failed. A cleaner approach would be to allow the user to recover (=retry) the failed merge post-attach.
After the BZs that this bug depend on, I'd expect a recovery from any failure due to live merge, either by retrying live merge or cold megre. I'd like to list again the live merge stepos that are performed by the engine, and describe how recpovery should work: Extend Merge ----- Merge Status Delete Volume Verify Volume Deletion ----- Reduce Base Volume If live merge fails at Extend or Merge steps, the user should be able to retry live merge or, if the Vm shutdown, cold merge should recover the live megre failure. Here, Merge could fail due to different reasons, for example, job aborted at libvirt, etc. If live merge fails at Merge Status, Delete Volume or Verify Volume Deletion, the engine will keep retrying executing these commands until succeeds. This is a reasonable behavior, because the failure in this steps could be due temporary network exceptions or a host that went non-responsive. Any failure at Reduce Volume step should not fail the live merge operation.
Raz, This bug should be resolved now. It is a RHV bug so it is up to you whether to wait for RHV bug in order to verify.
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops
Verified with the following code: ----------------------------------- ovirt-engine-4.2.0-0.5.master.el7.noarch vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64 Steps to Reproduce: 1. Start Live Removal of Snapshot S1 2. Reset Host, killing the VM 3. Snapshot Removal Fails, one image in chain is illegal 4. Detach SD 5. Attach SD 6. Import VM 7. Chain is now seen in the Engine as healthy (no illegal images) 8. VM starts successfully 9. Same snapshot is fails to delete us': u'Active', 'diskfree': '15569256448', 'isoprefix': '', 'alerts': [], 'disktotal': '21072183296', 'version': 4}}} from=::ffff:10.35.161.8,57268, task_id=6cf396fc-9988-4c4b-970e-446165613e10 (api:52) 2017-11-29 22:13:12,267+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.getInfo succeeded in 0.27 seconds (__init__:573) 2017-11-29 22:13:12,331+0200 ERROR (jsonrpc/3) [virt.vm] (vmId='8bcc5076-1fd9-4f7b-91a8-27c947513ffd') Live merge failed (job: 6269d03b-2ecf-4417-86d0-442839d21a19) (vm:5646) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5644, in merge bandwidth, flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 678, in blockCommit if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'block-commit': Could not reopen file: Permission denied 2017-11-29 22:13:12,375+0200 INFO (jsonrpc/3) [api.virt] FINISH merge return={'status': {'message': 'Merge failed', 'code': 52}} from=::ffff:10.35.161.8,57262, flow_id=cc44bac5-068b-4778-9112-d42997850b70 (api: 52) Moving to ASSIGN
Created attachment 1360540 [details] vdsm, server, engine logs, supervdsm logs Adding logs
Not sure what you did but you are hitting a live merge bug - BZ 1509675. If it is a recovery using cold merge, not sure I understand how you are hitting a live merge issue. Based on the steps you provided, after step #7 you should merge the snapshot while the VM is down. If starting the VM again, this is live merge and a recovery using cold merge. Moving back to ON_QA.
Please try to understand the original steps provided. Again, most important is to make sure it is possible to run the VM after live merge fails, i.e. no volumes are in illegal state at the storage. See the steps in the description: 1. Start Live Removal of Snapshot S1 2. Reset Host, killing the VM 3. Snapshot Removal Fails, one image in chain is illegal 4. Detach Sow D 5. Attach SD 6. Import VM 7. Chain is now seen in the Engine as healthy (no illegal images) 8. VM fails to start (illegal image in chain) 9. Same snapshot fail to remove (retry flow So before the fix, after live merge fails, the VM is down and cannot be started. Now in your case, you were able to start the VM, which is great; however, the VM is up now and you cannot test recovery using cold merge. That's why I suggested that you try cold merge to see if helps recovering the live merge that failed.
Removing the needinfo
Verified with the following code: ----------------------------------- ovirt-engine-4.2.0-0.5.master.el7.noarch vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64 Ran the scenario as per comment 12: ----------------------------------- 1. create a VM 2. create snapshot(s) 3. write data 4. merge the active layer 5. shutdown the VM during merge 6. merge the same snapshot (while the VM is DOWN) - works fine The live merge issue is covered by BZ 1509675. Moving to verified!
Added a gerrit patch to appease our bot overlords.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync