Bug 1384321
Summary: | Allow using cold merge to recover a failed live/cold merge | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | ||||
Component: | ovirt-engine | Assignee: | Ala Hino <ahino> | ||||
Status: | CLOSED ERRATA | QA Contact: | Kevin Alon Goldblatt <kgoldbla> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.0.3 | CC: | apinnick, eshames, gwatson, lsurette, ratamir, rbalakri, Rhev-m-bugs, srevivo, tnisan, ykaul, ylavi | ||||
Target Milestone: | ovirt-4.2.0 | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Previously, if a virtual machine was shut down during a live merge, an illegal snapshot disk was created, the live merge failed, and the virtual machine did not start up. In the current release, the virtual machine can be recovered with a cold merge.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-05-15 17:38:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1383301, 1496399 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Germano Veit Michel
2016-10-13 06:23:06 UTC
Sorry, forgot this: Version-Release number of selected component (if applicable): ovirt-engine-4.0.4.4-0.1.el7ev.noarch vdsm-4.18.11-1.el7ev.x86_64 We can't block detaching if a merge failed. A cleaner approach would be to allow the user to recover (=retry) the failed merge post-attach. After the BZs that this bug depend on, I'd expect a recovery from any failure due to live merge, either by retrying live merge or cold megre. I'd like to list again the live merge stepos that are performed by the engine, and describe how recpovery should work: Extend Merge ----- Merge Status Delete Volume Verify Volume Deletion ----- Reduce Base Volume If live merge fails at Extend or Merge steps, the user should be able to retry live merge or, if the Vm shutdown, cold merge should recover the live megre failure. Here, Merge could fail due to different reasons, for example, job aborted at libvirt, etc. If live merge fails at Merge Status, Delete Volume or Verify Volume Deletion, the engine will keep retrying executing these commands until succeeds. This is a reasonable behavior, because the failure in this steps could be due temporary network exceptions or a host that went non-responsive. Any failure at Reduce Volume step should not fail the live merge operation. Raz, This bug should be resolved now. It is a RHV bug so it is up to you whether to wait for RHV bug in order to verify. INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops Verified with the following code: ----------------------------------- ovirt-engine-4.2.0-0.5.master.el7.noarch vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64 Steps to Reproduce: 1. Start Live Removal of Snapshot S1 2. Reset Host, killing the VM 3. Snapshot Removal Fails, one image in chain is illegal 4. Detach SD 5. Attach SD 6. Import VM 7. Chain is now seen in the Engine as healthy (no illegal images) 8. VM starts successfully 9. Same snapshot is fails to delete us': u'Active', 'diskfree': '15569256448', 'isoprefix': '', 'alerts': [], 'disktotal': '21072183296', 'version': 4}}} from=::ffff:10.35.161.8,57268, task_id=6cf396fc-9988-4c4b-970e-446165613e10 (api:52) 2017-11-29 22:13:12,267+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.getInfo succeeded in 0.27 seconds (__init__:573) 2017-11-29 22:13:12,331+0200 ERROR (jsonrpc/3) [virt.vm] (vmId='8bcc5076-1fd9-4f7b-91a8-27c947513ffd') Live merge failed (job: 6269d03b-2ecf-4417-86d0-442839d21a19) (vm:5646) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5644, in merge bandwidth, flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 678, in blockCommit if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'block-commit': Could not reopen file: Permission denied 2017-11-29 22:13:12,375+0200 INFO (jsonrpc/3) [api.virt] FINISH merge return={'status': {'message': 'Merge failed', 'code': 52}} from=::ffff:10.35.161.8,57262, flow_id=cc44bac5-068b-4778-9112-d42997850b70 (api: 52) Moving to ASSIGN Created attachment 1360540 [details]
vdsm, server, engine logs, supervdsm logs
Adding logs
Not sure what you did but you are hitting a live merge bug - BZ 1509675. If it is a recovery using cold merge, not sure I understand how you are hitting a live merge issue. Based on the steps you provided, after step #7 you should merge the snapshot while the VM is down. If starting the VM again, this is live merge and a recovery using cold merge. Moving back to ON_QA. Please try to understand the original steps provided. Again, most important is to make sure it is possible to run the VM after live merge fails, i.e. no volumes are in illegal state at the storage. See the steps in the description: 1. Start Live Removal of Snapshot S1 2. Reset Host, killing the VM 3. Snapshot Removal Fails, one image in chain is illegal 4. Detach Sow D 5. Attach SD 6. Import VM 7. Chain is now seen in the Engine as healthy (no illegal images) 8. VM fails to start (illegal image in chain) 9. Same snapshot fail to remove (retry flow So before the fix, after live merge fails, the VM is down and cannot be started. Now in your case, you were able to start the VM, which is great; however, the VM is up now and you cannot test recovery using cold merge. That's why I suggested that you try cold merge to see if helps recovering the live merge that failed. Removing the needinfo Verified with the following code: ----------------------------------- ovirt-engine-4.2.0-0.5.master.el7.noarch vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64 Ran the scenario as per comment 12: ----------------------------------- 1. create a VM 2. create snapshot(s) 3. write data 4. merge the active layer 5. shutdown the VM during merge 6. merge the same snapshot (while the VM is DOWN) - works fine The live merge issue is covered by BZ 1509675. Moving to verified! INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops Added a gerrit patch to appease our bot overlords. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488 BZ<2>Jira Resync |