1301083 – Live merge fails with "Failed to live merge, still in volume chain"

Bug 1301083 - Live merge fails with "Failed to live merge, still in volume chain"

Summary: Live merge fails with "Failed to live merge, still in volume chain"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.0.0-beta
Target Release:	4.0.0
Assignee:	Ala Hino
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1247379
Blocks:	1333302 1333304
TreeView+	depends on / blocked

Reported:	2016-01-22 14:37 UTC by Gordon Watson
Modified:	2019-11-14 07:21 UTC (History)
CC List:	18 users (show)
Fixed In Version:	ovirt-engine-4.0.0_beta1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1333302 1333304 (view as bug list)
Environment:
Last Closed:	2016-08-23 20:31:32 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2248031	None	None	None	2016-04-18 00:26:59 UTC
Red Hat Product Errata	RHEA-2016:1743	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine)	2016-09-02 21:54:01 UTC
oVirt gerrit	56472	master	MERGED	core: Live Merge: Improve Live Merge Recovery Mechanism	2020-03-09 18:24:46 UTC
oVirt gerrit	56534	master	ABANDONED	core: Live Merge: Validate snapshot legality	2020-03-09 18:24:46 UTC
oVirt gerrit	56734	ovirt-engine-3.6	MERGED	core: Live Merge: Improve Live Merge Recovery Mechanism	2020-03-09 18:24:46 UTC

Description Gordon Watson 2016-01-22 14:37:37 UTC

Description of problem:

A live merge failed with "Failed to live merge, still in volume chain", however I have not been able to find the reason why.

The engine log just says "Failed to live merge, still in volume chain".

The vdsm log indicates that the merge completed and the subsequent image synchronization completed. No errors were reported in these sequences. It does indicate that the image that should have been deleted is still in the chain.

There are no libvirtd logs,

The qemu log for the VM doesn't indicate any error/message.


Version-Release number of selected component (if applicable):

RHEV 3.5.1
RHEV-H 7.1 20150512.1.el7ev
  vdsm-4.16.13.1-1.el7
  libvirt-1.2.8-16.el7_1.2
  qemu-kvm-rhev-2.1.2-23.el7_1.3


How reproducible:

Not respducible, however the customer has encountered multiple live merge failures subsequently.


Steps to Reproduce:
1.
2.
3.


Actual results:

Live merge with a specific error, but no indication in the logs of how to determine why.


Expected results:

Appropriate errors messages logged.


Additional info:

Comment 10 Allon Mureinik 2016-03-28 14:06:18 UTC

There are a bunch of fixes since 3.5.1 to 3.5.8, so the issue, if it still exists, should be dramatically mitigated.

Pushing out to 3.6.6 as to not risk 3.6.5

Comment 15 Aharon Canan 2016-04-12 12:05:52 UTC

Ala, 

Do we know how to reproduce/verify? 

Aharon.

Comment 19 Allon Mureinik 2016-05-01 19:13:11 UTC

Ala, patch https://gerrit.ovirt.org/#/c/56734/ on the 3.6 branch seems to address this. Are we waiting for anything else? If so - please explain what. If not, please take this BZ through the cloning process

Comment 20 Yaniv Kaul 2016-05-04 18:30:36 UTC

(In reply to Allon Mureinik from comment #19)
> Ala, patch https://gerrit.ovirt.org/#/c/56734/ on the 3.6 branch seems to
> address this. Are we waiting for anything else? If so - please explain what.
> If not, please take this BZ through the cloning process

The target milestone is still 3.6.6 - not sure what the status is.

Comment 21 Ala Hino 2016-05-04 19:22:10 UTC

This BZ is fixed.

Currently, we are not able to clone this BZ due to an infra issue. Will update BZ info once issue resolved.

Comment 22 Yaniv Lavi 2016-05-05 08:46:46 UTC

(In reply to Ala Hino from comment #21)
> This BZ is fixed.
> 
> Currently, we are not able to clone this BZ due to an infra issue. Will
> update BZ info once issue resolved.

Then do it manually.

Comment 23 Tal Nisan 2016-05-05 09:08:25 UTC

This bug has been cloned to the z-stream bug #1333304

Comment 24 Kevin Alon Goldblatt 2016-06-05 21:15:58 UTC

Test the with the following code:
------------------------------------------
rhevm-4.0.0.2-0.1.el7ev.noarch
vdsm-4.18.1-11.gita92976e.el7ev.x86_64


Tested with the following scenario:
-------------------------------------
1.Created Vm with 3 disks, 1 iscsi - preallocated, 1 iscsi - thin and 1 nfs - thin
2. Created snapshots sbu1, sbu2 and sbu3 while VM is dowm
3. Started the VM and created su1, su2 and su3
4. Selected snbu2 and tried deleting >>>>> The operation failed with "Failed to delete snapshot 'sbu2' for VM 'vm1'."
vdsm log displays error Volume  does not exist"






Engine.log:
-----------------
2016-06-05 14:11:54,499 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [] Correlation ID: 69d26cca, Job ID: d1762475-b0e4-44c6-9d15-4d36f2b1a4e0, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'sbu2' deletion for VM 'vm1' was initiated by admin@internal-authz.
.
.
.
2016-06-05 14:14:09,898 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (DefaultQuartzScheduler2) [6067c86] START, GetImageInfoVDSCommand( GetImageInfoVDSCommandPar
ameters:{runAsync='true', storagePoolId='3f76ceea-84c5-4893-8c7d-3bf2d8eea69e', ignoreFailoverLimit='false', storageDomainId='7590d3ff-797a-4042-90f8-4094e6d01665', imageGroupId='181ba273-4
150-4ad0-a209-95184cd1fbd4', imageId='3c27fae4-6ac7-436e-8302-e165632e5108'}), log id: 135a9ad8
2016-06-05 14:14:10,928 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (DefaultQuartzScheduler2) [6067c86] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine
.core.common.businessentities.storage.DiskImage@7786d8dc, log id: 135a9ad8
2016-06-05 14:14:11,206 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (DefaultQuartzScheduler2) [6067c86] transaction rolled back
2016-06-05 14:14:11,206 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (DefaultQuartzScheduler2) [6067c86] transaction rolled back
2016-06-05 14:14:11,206 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (DefaultQuartzScheduler2) [6067c86] Error invoking callback method 'onFailed' for 'FAILED' command 'c2
b114ea-5b53-4686-86c1-175c97ad19a8'
2016-06-05 14:14:11,206 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (DefaultQuartzScheduler2) [6067c86] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.utils.ovf.OvfWriter.buildDisk(OvfWriter.java:204) [utils.jar:]
        at org.ovirt.engine.core.utils.ovf.OvfManager.buildOvf(OvfManager.java:118) [utils.jar:]
        at org.ovirt.engine.core.utils.ovf.OvfManager.exportVm(OvfManager.java:29) [utils.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.ImagesHandler.prepareSnapshotConfigWithAlternateImage(ImagesHandler.java:829) [bll.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.lambda$updateVmConfigurationForImageChange$1(RemoveSnapshotSingleDiskLiveCommand.java:394) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:204) [utils.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.updateVmConfigurationForImageChange(RemoveSnapshotSingleDiskLiveCommand.java:392) [bll.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.handleBackwardLiveMerge(RemoveSnapshotSingleDiskLiveCommand.java:382) [bll.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.lambda$syncDbRecords$0(RemoveSnapshotSingleDiskLiveCommand.java:303) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:204) [utils.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.syncDbRecords(RemoveSnapshotSingleDiskLiveCommand.java:291) [bll.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.onFailed(RemoveSnapshotSingleDiskLiveCommand.java:447) [bll.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback.onFailed(RemoveSnapshotSingleDiskLiveCommandCallback.java:18) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.endCallback(CommandCallbacksPoller.java:56) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) [bll.jar:]
        at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source) [:1.8.0_91]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_91]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_91]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_91]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91]

2016-06-05 14:14:12,249 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [6067c86] Command 'RemoveSnapshot' id: 'fec75b49-40a7-4439-8dde-
d9db3bb221b2' child commands '[b77126ce-f696-4bdc-a0f6-b411841c5ad4, 24b9ea03-2be1-4f0a-a1ab-3ba54c209f58, c2b114ea-5b53-4686-86c1-175c97ad19a8]' executions were completed, status 'FAILED'
2016-06-05 14:14:13,520 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (DefaultQuartzScheduler10) [6067c86] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnap
shotCommand' with failure.
2016-06-05 14:14:13,724 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [6067c86] Correlation ID: 69d26cca, Job ID: d1762475-b0e4-44c6-9d15-4d36f2b1a4e0, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'sbu2' for VM 'vm1'.




VDSM log:
--------------------------------
Thread-430::DEBUG::2016-06-05 14:13:30,758::resourceManager::241::Storage.ResourceManager.Request::(grant) ResName=`Storage.7590d3ff-797a-4042-90f8-4094e6d01665`ReqID=`8e5f76d1-4683-46d5-b823-43ad815e8165`::Granted request
Thread-430::DEBUG::2016-06-05 14:13:30,758::task::829::Storage.TaskManager.Task::(resourceAcquired) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::_resourcesAcquired: Storage.7590d3ff-797a-4042-90f8-4094e6d01665 (shared)
Thread-430::DEBUG::2016-06-05 14:13:30,758::task::995::Storage.TaskManager.Task::(_decref) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::ref 1 aborting False
Thread-430::DEBUG::2016-06-05 14:13:30,759::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a516000c8|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 7590d3ff-797a-4042-90f8-4094e6d01665 (cwd None)
Thread-430::DEBUG::2016-06-05 14:13:30,961::lvm::288::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0
Thread-430::DEBUG::2016-06-05 14:13:30,963::lvm::463::Storage.LVM::(_reloadlvs) lvs reloaded
Thread-430::WARNING::2016-06-05 14:13:30,963::lvm::609::Storage.LVM::(getLv) lv: 9f680fdb-fe8a-462c-9977-7aec183ec6af not found in lvs vg: 7590d3ff-797a-4042-90f8-4094e6d01665 response
Thread-430::ERROR::2016-06-05 14:13:30,963::task::868::Storage.TaskManager.Task::(_setError) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 875, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3039, in getVolumeInfo
    volUUID=volUUID).getInfo()
  File "/usr/share/vdsm/storage/sd.py", line 619, in produceVolume
    volUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 370, in __init__
    manifest = self.manifestClass(repoPath, sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 72, in __init__
    volUUID)
  File "/usr/share/vdsm/storage/volume.py", line 80, in __init__
    self.validate()
  File "/usr/share/vdsm/storage/blockVolume.py", line 156, in validate
    raise se.VolumeDoesNotExist(self.volUUID)
VolumeDoesNotExist: Volume does not exist: ('9f680fdb-fe8a-462c-9977-7aec183ec6af',)
Thread-430::DEBUG::2016-06-05 14:13:30,989::task::887::Storage.TaskManager.Task::(_run) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Task._run: 18cc50f9-e374-46bb-9389-e48f192628c8 ('7590d3ff-797a-4042-90f8-4094e6d01665', '3f76ceea-84c5-4893-8c7d-3bf2d8eea69e', '9c3f4477-4041-4c4f-8add-b547db25c839', '9f680fdb-fe8a-462c-9977-7aec183ec6af') {} failed - stopping task
Thread-430::DEBUG::2016-06-05 14:13:30,989::task::1248::Storage.TaskManager.Task::(stop) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::stopping in state preparing (force False)
Thread-430::DEBUG::2016-06-05 14:13:30,989::task::995::Storage.TaskManager.Task::(_decref) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::ref 1 aborting True
Thread-430::INFO::2016-06-05 14:13:30,989::task::1173::Storage.TaskManager.Task::(prepare) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::aborting: Task is aborted: 'Volume does not exist' - code 201
Thread-430::DEBUG::2016-06-05 14:13:30,990::task::1178::Storage.TaskManager.Task::(prepare) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Prepare: aborted: Volume does not exist
Thread-430::DEBUG::2016-06-05 14:13:30,990::task::995::Storage.TaskManager.Task::(_decref) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::ref 0 aborting True
Thread-430::DEBUG::2016-06-05 14:13:30,990::task::930::Storage.TaskManager.Task::(_doAbort) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Task._doAbort: force False
Thread-430::DEBUG::2016-06-05 14:13:30,990::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-430::DEBUG::2016-06-05 14:13:30,990::task::597::Storage.TaskManager.Task::(_updateState) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::moving from state preparing -> state aborting
Thread-430::DEBUG::2016-06-05 14:13:30,990::task::552::Storage.TaskManager.Task::(__state_aborting) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::_aborting: recover policy none
Thread-430::DEBUG::2016-06-05 14:13:30,990::task::597::Storage.TaskManager.Task::(_updateState) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::moving from state aborting -> state failed
Thread-430::DEBUG::2016-06-05 14:13:30,991::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.7590d3ff-797a-4042-90f8-4094e6d01665': < ResourceRef 'Storage.7590d3ff-797a-4042-90f8-4094e6d01665', isValid: 'True' obj: 'None'>}
Thread-430::DEBUG::2016-06-05 14:13:30,991::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}

Comment 25 Kevin Alon Goldblatt 2016-06-15 17:17:25 UTC

Tested with the following code:
------------------------------------------
rhevm-4.0.0.4-0.1.el7ev.noarch
vdsm-4.18.2-0.el7ev.x86_64




Tested with the following scenario:
-------------------------------------
1.Created Vm with 3 disks, 1 iscsi - preallocated, 1 iscsi - thin and 1 nfs - thin
2. Created snapshots sd1, sd2, sd3 and sd4 while VM is dowm
3. Deleted sd2 successfully
4. Deleted sd3 successfully
5. Started the VM
6. Created snapshot su1
7. Created snapshot su2
8. Created snapshot su3
9. Created snapshot su4
10. Deleted snapshot sd1 successfully
11. Deleted snapshot su2 successfully

Moving to VERIFIED!

Comment 27 errata-xmlrpc 2016-08-23 20:31:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Note You need to log in before you can comment on or make changes to this bug.