1384321 – Allow using cold merge to recover a failed live/cold merge

Bug 1384321 - Allow using cold merge to recover a failed live/cold merge

Summary: Allow using cold merge to recover a failed live/cold merge

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Ala Hino
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1383301 1496399
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-13 06:23 UTC by Germano Veit Michel
Modified:	2020-01-17 16:01 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, if a virtual machine was shut down during a live merge, an illegal snapshot disk was created, the live merge failed, and the virtual machine did not start up. In the current release, the virtual machine can be recovered with a cold merge.
Clone Of:
Environment:
Last Closed:	2018-05-15 17:38:43 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm, server, engine logs, supervdsm logs (1.68 MB, application/x-gzip) 2017-11-29 20:56 UTC, Kevin Alon Goldblatt	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:1488	0	None	None	None	2018-05-15 17:40:57 UTC
oVirt gerrit	82528	0	'None'	MERGED	core: Retry failed live merge commands	2021-02-15 18:00:50 UTC

Description Germano Veit Michel 2016-10-13 06:23:06 UTC

Description of problem:

We cannot allow a SD to be detached while there are image operations that need a recovery flow.

Several Snapshot Removal flows, when interrupted can lead to Illegal Images in the SD. I believe there are also cases where the generated OVFs may not reflect the current chain in the SD, so the VM import may fail looking for images that don't exist or the wrong chain, or even import fine using images that were already merged, outdated, and should be deleted. There are likely literally dozens of cases where things can go wrong and we fail to remove a snapshot, which needs to be retried.

The OVF does NOT carry all the image bits from the DB. It only carries the chain and does not contain ANY information if a snapshot removal needs to be retried and/or go to recovery flow (i.e.: no illegal info)

Version-Release number of selected component (if applicable):

How reproducible:
100% - several ways

Steps to Reproduce:
1. Start Live Removal of Snapshot S1
2. Reset Host, killing the VM
3. Snapshot Removal Fails, one image in chain is illegal
4. Detach SD
5. Attach SD
6. Import VM
7. Chain is now seen in the Engine as healthy (no illegal images)
8. VM fails to start (illegal image in chain)
9. Same snapshot fail to remove (retry flow 

Logs:

2016-10-13 01:57:06,717 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [3cd2bd32] Correlation ID: 3cd2bd32, Job ID: 8ca792cb-c249-4241-ab02-fbf7d4036760, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'.

2016-10-13 02:01:40,719 INFO  [org.ovirt.engine.core.utils.archivers.tar.TarInMemoryExport] (org.ovirt.thread.pool-6-thread-43) [269a8b50] Retrieve OVF Entity from storage domain ID 'bfed9c9a-30c3-45d9-afaf-9e695ad9e2f5' for entity ID '4f1d68ed-05a2-493c-a849-45f029ffb3b7', entity name 'CentOS7' and VM Type of 'VM'

2016-10-13 02:01:41,053 INFO  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-43) [560fe0a5] Adding OVF data of entity id '4f1d68ed-05a2-493c-a849-45f029ffb3b7' and entity name 'CentOS7'

Job ID: 82460d6d-1843-48f4-bb41-6023ac16d714, Call Stack: null, Custom Event ID: -1, Message: VM CentOS7 was started by admin@internal-authz (Host: RHVH-20).
2016-10-13 02:02:30,086 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] VM '4f1d68ed-05a2-493c-a849-45f029ffb3b7'(CentOS7) moved from 'WaitForLaunch' --> 'Down'

2016-10-13 02:02:30,122 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-5) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM CentOS7 is down with error. Exit message: Bad volume specification {u'index': 0, u'domainID': u'bfed9c9a-30c3-45d9-afaf-9e695ad9e2f5', 'reqsize': '0', u'format': u'cow', u'bootOrder': u'1', u'address': {u'function': u'0x0', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'slot': u'0x06'}, u'volumeID': u'6c1749d9-86f8-431a-bb0c-d8f138ab6dc4', 'apparentsize': '1073741824', u'imageID': u'6aa3112f-eb2f-46d1-a9a6-23a4123a8d8f', u'specParams': {}, u'readonly': u'false', u'iface': u'virtio', u'optional': u'false', u'deviceId': u'6aa3112f-eb2f-46d1-a9a6-23a4123a8d8f', 'truesize': '1073741824', u'poolID': u'00000001-0001-0001-0001-0000000002f3', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}.

2016-10-13 02:02:32,164 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] add VM '4f1d68ed-05a2-493c-a849-45f029ffb3b7'(CentOS7) to rerun treatment

2016-10-13 02:02:32,260 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-47) [] Correlation ID: 469132a9, Job ID: 82460d6d-1843-48f4-bb41-6023ac16d714, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM CentOS7 on Host RHVH-10.

2016-10-13 02:08:18,828 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-24) [7572b924] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'S1' deletion for VM 'CentOS7' was initiated by admin@internal-authz.

2016-10-13 02:08:22,436 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-12) [40bd694f] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Due to partial snapshot removal, Snapshot 'S1' of VM 'CentOS7' now contains only the following disks: 'CentOS7_Disk1'.

2016-10-13 02:08:22,493 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-12) [] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'.

And try again:

2016-10-13 02:19:43,407 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-24) [3043e221] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'S1' deletion for VM 'CentOS7' was initiated by admin@internal-authz.

2016-10-13 02:19:44,788 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-7) [7748262d] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Due to partial snapshot removal, Snapshot 'S1' of VM 'CentOS7' now contains only the following disks: 'CentOS7_Disk1'.

2016-10-13 02:19:44,851 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-7) [] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'.

Actual results:
VM is unusable, no idea on how to recover it via GUI. Probably needs manual intervention to recover.

Expected results:
Do not allow to detach SD

See also:
BZ 1372163 - [RFE] Warn user about VMs that have pending snapshot removal retries

Comment 1 Germano Veit Michel 2016-10-13 06:29:55 UTC

Sorry, forgot this:

Version-Release number of selected component (if applicable):
ovirt-engine-4.0.4.4-0.1.el7ev.noarch
vdsm-4.18.11-1.el7ev.x86_64

Comment 2 Allon Mureinik 2017-03-19 16:26:30 UTC

We can't block detaching if a merge failed.
A cleaner approach would be to allow the user to recover (=retry) the failed merge post-attach.

Comment 3 Ala Hino 2017-10-26 12:39:26 UTC

After the BZs that this bug depend on, I'd expect a recovery from any failure due to live merge, either by retrying live merge or cold megre.
I'd like to list again the live merge stepos that are performed by the engine, and describe how recpovery should work:

Extend
Merge
-----
Merge Status
Delete Volume
Verify Volume Deletion
-----
Reduce Base Volume

If live merge fails at Extend or Merge steps, the user should be able to retry live merge or, if the Vm shutdown, cold merge should recover the live megre failure.
Here, Merge could fail due to different reasons, for example, job aborted at libvirt, etc.

If live merge fails at Merge Status, Delete Volume or Verify Volume Deletion, the engine will keep retrying executing these commands until succeeds. This is a reasonable behavior, because the failure in this steps could be due temporary network exceptions or a host that went non-responsive.

Any failure at Reduce Volume step should not fail the live merge operation.

Comment 4 Ala Hino 2017-10-29 12:56:40 UTC

Raz,

This bug should be resolved now.
It is a RHV bug so it is up to you whether to wait for RHV bug in order to verify.

Comment 5 rhev-integ 2017-11-02 13:39:58 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 6 rhev-integ 2017-11-02 21:09:14 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 7 Kevin Alon Goldblatt 2017-11-29 20:55:10 UTC

Verified with the following code:
-----------------------------------
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64

Steps to Reproduce:
1. Start Live Removal of Snapshot S1
2. Reset Host, killing the VM
3. Snapshot Removal Fails, one image in chain is illegal
4. Detach SD
5. Attach SD
6. Import VM
7. Chain is now seen in the Engine as healthy (no illegal images)
8. VM starts successfully
9. Same snapshot is fails to delete



us': u'Active', 'diskfree': '15569256448', 'isoprefix': '', 'alerts': [], 'disktotal': '21072183296', 'version': 4}}} from=::ffff:10.35.161.8,57268, task_id=6cf396fc-9988-4c4b-970e-446165613e10 (api:52)
2017-11-29 22:13:12,267+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.getInfo succeeded in 0.27 seconds (__init__:573)
2017-11-29 22:13:12,331+0200 ERROR (jsonrpc/3) [virt.vm] (vmId='8bcc5076-1fd9-4f7b-91a8-27c947513ffd') Live merge failed (job: 6269d03b-2ecf-4417-86d0-442839d21a19) (vm:5646)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5644, in merge
    bandwidth, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 678, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'block-commit': Could not reopen file: Permission denied
2017-11-29 22:13:12,375+0200 INFO  (jsonrpc/3) [api.virt] FINISH merge return={'status': {'message': 'Merge failed', 'code': 52}} from=::ffff:10.35.161.8,57262, flow_id=cc44bac5-068b-4778-9112-d42997850b70 (api:
52)


Moving to ASSIGN

Comment 8 Kevin Alon Goldblatt 2017-11-29 20:56:29 UTC

Created attachment 1360540 [details]
vdsm, server, engine logs, supervdsm logs

Adding logs

Comment 9 Ala Hino 2017-11-29 21:08:51 UTC

Not sure what you did but you are hitting a live merge bug - BZ 1509675.
If it is a recovery using cold merge, not sure I understand how you are hitting a live merge issue.

Based on the steps you provided, after step #7 you should merge the snapshot while the VM is down. If starting the VM again, this is live merge and a recovery using cold merge.

Moving back to ON_QA.

Comment 10 Ala Hino 2017-11-29 21:17:06 UTC

Please try to understand the original steps provided.
Again, most important is to make sure it is possible to run the VM after live merge fails, i.e. no volumes are in illegal state at the storage.

See the steps in the description:
1. Start Live Removal of Snapshot S1
2. Reset Host, killing the VM
3. Snapshot Removal Fails, one image in chain is illegal
4. Detach Sow D
5. Attach SD
6. Import VM
7. Chain is now seen in the Engine as healthy (no illegal images)
8. VM fails to start (illegal image in chain)
9. Same snapshot fail to remove (retry flow

So before the fix, after live merge fails, the VM is down and cannot be started.
Now in your case, you were able to start the VM, which is great; however, the VM is up now and you cannot test recovery using cold merge. That's why I suggested that you try cold merge to see if helps recovering the live merge that failed.

Comment 13 Raz Tamir 2017-11-30 10:00:40 UTC

Removing the needinfo

Comment 14 Kevin Alon Goldblatt 2017-11-30 17:00:29 UTC

Verified with the following code:
-----------------------------------
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64


Ran the scenario as per comment 12:
-----------------------------------

1. create a VM
2. create snapshot(s)
3. write data
4. merge the active layer
5. shutdown the VM during merge
6. merge the same snapshot (while the VM is DOWN) - works fine

The live merge issue is covered by  BZ 1509675.


Moving to verified!

Comment 15 RHV bug bot 2017-12-06 16:16:58 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 16 RHV bug bot 2017-12-12 21:15:33 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 17 Allon Mureinik 2017-12-13 16:58:15 UTC

Added a gerrit patch to appease our bot overlords.

Comment 20 errata-xmlrpc 2018-05-15 17:38:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 21 Franta Kust 2019-05-16 13:08:00 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.