Bug 1384321

Summary:

Allow using cold merge to recover a failed live/cold merge

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Germano Veit Michel <gveitmic>

Component:

ovirt-engine

Assignee:

Ala Hino <ahino>

Status:

CLOSED ERRATA

QA Contact:

Kevin Alon Goldblatt <kgoldbla>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.0.3

CC:

apinnick, eshames, gwatson, lsurette, ratamir, rbalakri, Rhev-m-bugs, srevivo, tnisan, ykaul, ylavi

Target Milestone:

ovirt-4.2.0

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Previously, if a virtual machine was shut down during a live merge, an illegal snapshot disk was created, the live merge failed, and the virtual machine did not start up. In the current release, the virtual machine can be recovered with a cold merge.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-05-15 17:38:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1383301, 1496399

Bug Blocks:

Attachments:

Description	Flags
vdsm, server, engine logs, supervdsm logs	none

Description Germano Veit Michel 2016-10-13 06:23:06 UTC

Description of problem:

We cannot allow a SD to be detached while there are image operations that need a recovery flow.

Several Snapshot Removal flows, when interrupted can lead to Illegal Images in the SD. I believe there are also cases where the generated OVFs may not reflect the current chain in the SD, so the VM import may fail looking for images that don't exist or the wrong chain, or even import fine using images that were already merged, outdated, and should be deleted. There are likely literally dozens of cases where things can go wrong and we fail to remove a snapshot, which needs to be retried.

The OVF does NOT carry all the image bits from the DB. It only carries the chain and does not contain ANY information if a snapshot removal needs to be retried and/or go to recovery flow (i.e.: no illegal info)

Version-Release number of selected component (if applicable):

How reproducible:
100% - several ways

Steps to Reproduce:
1. Start Live Removal of Snapshot S1
2. Reset Host, killing the VM
3. Snapshot Removal Fails, one image in chain is illegal
4. Detach SD
5. Attach SD
6. Import VM
7. Chain is now seen in the Engine as healthy (no illegal images)
8. VM fails to start (illegal image in chain)
9. Same snapshot fail to remove (retry flow 

Logs:

2016-10-13 01:57:06,717 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [3cd2bd32] Correlation ID: 3cd2bd32, Job ID: 8ca792cb-c249-4241-ab02-fbf7d4036760, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'.

2016-10-13 02:01:40,719 INFO  [org.ovirt.engine.core.utils.archivers.tar.TarInMemoryExport] (org.ovirt.thread.pool-6-thread-43) [269a8b50] Retrieve OVF Entity from storage domain ID 'bfed9c9a-30c3-45d9-afaf-9e695ad9e2f5' for entity ID '4f1d68ed-05a2-493c-a849-45f029ffb3b7', entity name 'CentOS7' and VM Type of 'VM'

2016-10-13 02:01:41,053 INFO  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-43) [560fe0a5] Adding OVF data of entity id '4f1d68ed-05a2-493c-a849-45f029ffb3b7' and entity name 'CentOS7'

Job ID: 82460d6d-1843-48f4-bb41-6023ac16d714, Call Stack: null, Custom Event ID: -1, Message: VM CentOS7 was started by admin@internal-authz (Host: RHVH-20).
2016-10-13 02:02:30,086 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] VM '4f1d68ed-05a2-493c-a849-45f029ffb3b7'(CentOS7) moved from 'WaitForLaunch' --> 'Down'

2016-10-13 02:02:30,122 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-5) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM CentOS7 is down with error. Exit message: Bad volume specification {u'index': 0, u'domainID': u'bfed9c9a-30c3-45d9-afaf-9e695ad9e2f5', 'reqsize': '0', u'format': u'cow', u'bootOrder': u'1', u'address': {u'function': u'0x0', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'slot': u'0x06'}, u'volumeID': u'6c1749d9-86f8-431a-bb0c-d8f138ab6dc4', 'apparentsize': '1073741824', u'imageID': u'6aa3112f-eb2f-46d1-a9a6-23a4123a8d8f', u'specParams': {}, u'readonly': u'false', u'iface': u'virtio', u'optional': u'false', u'deviceId': u'6aa3112f-eb2f-46d1-a9a6-23a4123a8d8f', 'truesize': '1073741824', u'poolID': u'00000001-0001-0001-0001-0000000002f3', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}.

2016-10-13 02:02:32,164 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] add VM '4f1d68ed-05a2-493c-a849-45f029ffb3b7'(CentOS7) to rerun treatment

2016-10-13 02:02:32,260 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-47) [] Correlation ID: 469132a9, Job ID: 82460d6d-1843-48f4-bb41-6023ac16d714, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM CentOS7 on Host RHVH-10.

2016-10-13 02:08:18,828 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-24) [7572b924] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'S1' deletion for VM 'CentOS7' was initiated by admin@internal-authz.

2016-10-13 02:08:22,436 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-12) [40bd694f] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Due to partial snapshot removal, Snapshot 'S1' of VM 'CentOS7' now contains only the following disks: 'CentOS7_Disk1'.

2016-10-13 02:08:22,493 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-12) [] Correlation ID: 40bd694f, Job ID: 593b13cc-10f8-489e-bef6-d1def55cd7ea, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'.

And try again:

2016-10-13 02:19:43,407 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-24) [3043e221] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'S1' deletion for VM 'CentOS7' was initiated by admin@internal-authz.

2016-10-13 02:19:44,788 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-7) [7748262d] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Due to partial snapshot removal, Snapshot 'S1' of VM 'CentOS7' now contains only the following disks: 'CentOS7_Disk1'.

2016-10-13 02:19:44,851 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-7) [] Correlation ID: 7748262d, Job ID: 9a0d2980-42c2-4306-9d7c-f3a9c1280fc5, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'S1' for VM 'CentOS7'.

Actual results:
VM is unusable, no idea on how to recover it via GUI. Probably needs manual intervention to recover.

Expected results:
Do not allow to detach SD

See also:
BZ 1372163 - [RFE] Warn user about VMs that have pending snapshot removal retries

Comment 1 Germano Veit Michel 2016-10-13 06:29:55 UTC

Sorry, forgot this:

Version-Release number of selected component (if applicable):
ovirt-engine-4.0.4.4-0.1.el7ev.noarch
vdsm-4.18.11-1.el7ev.x86_64

Comment 2 Allon Mureinik 2017-03-19 16:26:30 UTC

We can't block detaching if a merge failed.
A cleaner approach would be to allow the user to recover (=retry) the failed merge post-attach.

Comment 3 Ala Hino 2017-10-26 12:39:26 UTC

After the BZs that this bug depend on, I'd expect a recovery from any failure due to live merge, either by retrying live merge or cold megre.
I'd like to list again the live merge stepos that are performed by the engine, and describe how recpovery should work:

Extend
Merge
-----
Merge Status
Delete Volume
Verify Volume Deletion
-----
Reduce Base Volume

If live merge fails at Extend or Merge steps, the user should be able to retry live merge or, if the Vm shutdown, cold merge should recover the live megre failure.
Here, Merge could fail due to different reasons, for example, job aborted at libvirt, etc.

If live merge fails at Merge Status, Delete Volume or Verify Volume Deletion, the engine will keep retrying executing these commands until succeeds. This is a reasonable behavior, because the failure in this steps could be due temporary network exceptions or a host that went non-responsive.

Any failure at Reduce Volume step should not fail the live merge operation.

Comment 4 Ala Hino 2017-10-29 12:56:40 UTC

Raz,

This bug should be resolved now.
It is a RHV bug so it is up to you whether to wait for RHV bug in order to verify.

Comment 5 rhev-integ 2017-11-02 13:39:58 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 6 rhev-integ 2017-11-02 21:09:14 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 7 Kevin Alon Goldblatt 2017-11-29 20:55:10 UTC

Verified with the following code:
-----------------------------------
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64

Steps to Reproduce:
1. Start Live Removal of Snapshot S1
2. Reset Host, killing the VM
3. Snapshot Removal Fails, one image in chain is illegal
4. Detach SD
5. Attach SD
6. Import VM
7. Chain is now seen in the Engine as healthy (no illegal images)
8. VM starts successfully
9. Same snapshot is fails to delete



us': u'Active', 'diskfree': '15569256448', 'isoprefix': '', 'alerts': [], 'disktotal': '21072183296', 'version': 4}}} from=::ffff:10.35.161.8,57268, task_id=6cf396fc-9988-4c4b-970e-446165613e10 (api:52)
2017-11-29 22:13:12,267+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.getInfo succeeded in 0.27 seconds (__init__:573)
2017-11-29 22:13:12,331+0200 ERROR (jsonrpc/3) [virt.vm] (vmId='8bcc5076-1fd9-4f7b-91a8-27c947513ffd') Live merge failed (job: 6269d03b-2ecf-4417-86d0-442839d21a19) (vm:5646)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5644, in merge
    bandwidth, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 678, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'block-commit': Could not reopen file: Permission denied
2017-11-29 22:13:12,375+0200 INFO  (jsonrpc/3) [api.virt] FINISH merge return={'status': {'message': 'Merge failed', 'code': 52}} from=::ffff:10.35.161.8,57262, flow_id=cc44bac5-068b-4778-9112-d42997850b70 (api:
52)


Moving to ASSIGN

Comment 8 Kevin Alon Goldblatt 2017-11-29 20:56:29 UTC

Created attachment 1360540 [details]
vdsm, server, engine logs, supervdsm logs

Adding logs

Comment 9 Ala Hino 2017-11-29 21:08:51 UTC

Not sure what you did but you are hitting a live merge bug - BZ 1509675.
If it is a recovery using cold merge, not sure I understand how you are hitting a live merge issue.

Based on the steps you provided, after step #7 you should merge the snapshot while the VM is down. If starting the VM again, this is live merge and a recovery using cold merge.

Moving back to ON_QA.

Comment 10 Ala Hino 2017-11-29 21:17:06 UTC

Please try to understand the original steps provided.
Again, most important is to make sure it is possible to run the VM after live merge fails, i.e. no volumes are in illegal state at the storage.

See the steps in the description:
1. Start Live Removal of Snapshot S1
2. Reset Host, killing the VM
3. Snapshot Removal Fails, one image in chain is illegal
4. Detach Sow D
5. Attach SD
6. Import VM
7. Chain is now seen in the Engine as healthy (no illegal images)
8. VM fails to start (illegal image in chain)
9. Same snapshot fail to remove (retry flow

So before the fix, after live merge fails, the VM is down and cannot be started.
Now in your case, you were able to start the VM, which is great; however, the VM is up now and you cannot test recovery using cold merge. That's why I suggested that you try cold merge to see if helps recovering the live merge that failed.

Comment 13 Raz Tamir 2017-11-30 10:00:40 UTC

Removing the needinfo

Comment 14 Kevin Alon Goldblatt 2017-11-30 17:00:29 UTC

Verified with the following code:
-----------------------------------
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64


Ran the scenario as per comment 12:
-----------------------------------

1. create a VM
2. create snapshot(s)
3. write data
4. merge the active layer
5. shutdown the VM during merge
6. merge the same snapshot (while the VM is DOWN) - works fine

The live merge issue is covered by  BZ 1509675.


Moving to verified!

Comment 15 RHV bug bot 2017-12-06 16:16:58 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 16 RHV bug bot 2017-12-12 21:15:33 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 17 Allon Mureinik 2017-12-13 16:58:15 UTC

Added a gerrit patch to appease our bot overlords.

Comment 20 errata-xmlrpc 2018-05-15 17:38:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 21 Franta Kust 2019-05-16 13:08:00 UTC

BZ<2>Jira Resync