Hide Forgot
Description of problem: A Live Merge timed out on the engine, but continued and succeeded on the host. This left the VM's volumes in an inconsistent state and required extensive manually cleanup. A subsequent attempt to live merge the same snaphot failed with "Drive image file could not be found" reported by the engine. Version-Release number of selected component (if applicable): RHEV 3.5.6 RHEL 7.2 hosts w/vdsm-4.16.32-1 How reproducible: Not. Steps to Reproduce: 1. 2. 3. Actual results: - Live merge timed out on the engine, e.g. 2016-02-09 17:36:39,783 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [433adeb7] Command MergeVDSCommand(HostName = host2, MergeVDSCommandParameters{HostId = 8064fb65-939e-4675-aa30-764f4e885e9a, vmId=482aa8aa-aaa0-45ee-ac87-09dbb1fc3b26, storagePoolId=dffa92b5-0826-48f1-98a0-c6bfe1cf169c, storageDomainId=2c565c22-1e20-4c07-8457-51726763471a, imageGroupId=31b76a05-3b45-46ac-bed0-9a78fb7d12d2, imageId=9d69d818-c1a0-4f39-adee-bb2320b33dc9, baseImageId=3658e7a1-61ef-437e-82bd-29b20d3531e6, topImageId=9d69d818-c1a0-4f39-adee-bb2320b33dc9, bandwidth=0}) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2016-02-09 17:36:59,762 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-73) [616edc76] Merging of snapshot 62306512-ed74-4f40-a67a-dca1de382a00 images 3658e7a1-61ef-437e-82bd-29b20d3531e6..9d69d818-c1a0-4f39-adee-bb2320b33dc9 failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. - And succeeded on the host; jsonrpc.Executor-worker-6::DEBUG::2016-02-09 17:33:39,295::__init__::481::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'9d69d818-c1a0-4f39-adee-bb2320b33dc9', u'vmID': u'482aa8aa-aaa0-45ee-ac87-09dbb1fc3b26', u'drive': {u'domainID': u'2c565c22-1e20-4c07-8457-51726763471a', u'volumeID': u'9d69d818-c1a0-4f39-adee-bb2320b33dc9', u'poolID': u'dffa92b5-0826-48f1-98a0-c6bfe1cf169c', u'imageID': u'31b76a05-3b45-46ac-bed0-9a78fb7d12d2'}, u'bandwidth': u'0', u'jobUUID': u'7ee66159-cedd-40b3-b771-33cbde950acc', u'baseVolUUID': u'3658e7a1-61ef-437e-82bd-29b20d3531e6'} jsonrpc.Executor-worker-6::DEBUG::2016-02-09 17:48:43,085::__init__::514::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.merge' in bridge with True Thread-30532::INFO::2016-02-09 17:48:58,933::vm::6235::vm.Vm::(tryPivot) vmId=`482aa8aa-aaa0-45ee-ac87-09dbb1fc3b26`::Requesting pivot to complete active layer commit (job 7ee66159-cedd-40b3-b771-33cbde950acc) Thread-30532::INFO::2016-02-09 17:48:58,967::vm::6303::vm.Vm::(_waitForXMLUpdate) vmId=`482aa8aa-aaa0-45ee-ac87-09dbb1fc3b26`::Waiting for libvirt to update the XML after pivot of drive virtio-disk1 completed Thread-30532::INFO::2016-02-09 17:48:58,984::vm::6248::vm.Vm::(tryPivot) vmId=`482aa8aa-aaa0-45ee-ac87-09dbb1fc3b26`::Pivot completed (job 7ee66159-cedd-40b3-b771-33cbde950acc) Thread-30532::DEBUG::2016-02-09 17:48:59,004::vm::6109::vm.Vm::(_syncVolumeChain) vmId=`482aa8aa-aaa0-45ee-ac87-09dbb1fc3b26`::vdsm chain: ['3658e7a1-61ef-437e-82bd-29b20d3531e6', '9d69d818-c1a0-4f39-adee-bb2320b33dc9'], libvirt chain: ['3658e7a1-61ef-437e-82bd-29b20d3531e6'] - The database still contained both images, one now marked as "illegal". - Both images still physically existed in the storage domain. - The volume metadata for the merged image was marked as "ILLEGAL". - The volume metadata for the parent (in this case, base) image still contained "VOLTYPE=INTERNAL". - A subsequent live merge failed; 2016-02-09 18:05:30,241 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-1) [4dcbde22] Command MergeVDSCommand(HostName = host-2, MergeVDSCommandParameters{HostId = 8064fb65-939e-4675-aa30-764f4e885e9a, vmId=482aa8aa-aaa0-45ee-ac87-09dbb1fc3b26, storagePoolId=dffa92b5-0826-48f1-98a0-c6bfe1cf169c, storageDomainId=2c565c22-1e20-4c07-8457-51726763471a, imageGroupId=31b76a05-3b45-46ac-bed0-9a78fb7d12d2, imageId=9d69d818-c1a0-4f39-adee-bb2320b33dc9, baseImageId=3658e7a1-61ef-437e-82bd-29b20d3531e6, topImageId=9d69d818-c1a0-4f39-adee-bb2320b33dc9, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 2016-02-09 18:05:50,273 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-25) [73063f72] Merging of snapshot 62306512-ed74-4f40-a67a-dca1de382a00 images 3658e7a1-61ef-437e-82bd-29b20d3531e6..9d69d818-c1a0-4f39-adee-bb2320b33dc9 failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. Expected results: Either a susbsequent live merge should be prevented with a much clearer message indicating that the disk volumes my now be inconsistent and need immediate attention, and/or some some "recovery" could be attempted. Obviously the tricky part would be how to do this. Maybe a susbsequent query to the host to determine the current image chain and then act accordingly. But then the metadata could also be incorrect and automating that may not be feasible. Additional info: Details to follow.
Pushing out to 3.6.6 as to not risk 3.6.5
Ala, patch https://gerrit.ovirt.org/#/c/56734/ on the 3.6 branch seems to address this. Are we waiting for anything else? If so - please explain what. If not, please take this BZ through the cloning process
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.
Something seems to have gone wrong with the cloning, this is MODIFIED on 4.0.
Actually, it should be ON_QA - it's included in the first beta build.
*** Bug 1342681 has been marked as a duplicate of this bug. ***
Tested with the following code: ------------------------------------------ rhevm-4.0.0.2-0.1.el7ev.noarch vdsm-4.18.1-11.gita92976e.el7ev.x86_64 Tested with the following scenario: ------------------------------------- 1.Created Vm with 3 disks, 1 iscsi - preallocated, 1 iscsi - thin and 1 nfs - thin 2. Created snapshots sbu1, sbu2 and sbu3 while VM is dowm 3. Started the VM and created su1, su2 and su3 4. Selected snbu2 and tried deleting >>>>> The operation failed with "Failed to delete snapshot 'sbu2' for VM 'vm1'." vdsm log displays error Volume does not exist" Engine.log: ----------------- 2016-06-05 14:11:54,499 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [] Correlation ID: 69d26cca, Job ID: d1762475-b0e4-44c6-9d15-4d36f2b1a4e0, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'sbu2' deletion for VM 'vm1' was initiated by admin@internal-authz. . . . 2016-06-05 14:14:09,898 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (DefaultQuartzScheduler2) [6067c86] START, GetImageInfoVDSCommand( GetImageInfoVDSCommandPar ameters:{runAsync='true', storagePoolId='3f76ceea-84c5-4893-8c7d-3bf2d8eea69e', ignoreFailoverLimit='false', storageDomainId='7590d3ff-797a-4042-90f8-4094e6d01665', imageGroupId='181ba273-4 150-4ad0-a209-95184cd1fbd4', imageId='3c27fae4-6ac7-436e-8302-e165632e5108'}), log id: 135a9ad8 2016-06-05 14:14:10,928 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (DefaultQuartzScheduler2) [6067c86] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine .core.common.businessentities.storage.DiskImage@7786d8dc, log id: 135a9ad8 2016-06-05 14:14:11,206 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (DefaultQuartzScheduler2) [6067c86] transaction rolled back 2016-06-05 14:14:11,206 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (DefaultQuartzScheduler2) [6067c86] transaction rolled back 2016-06-05 14:14:11,206 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (DefaultQuartzScheduler2) [6067c86] Error invoking callback method 'onFailed' for 'FAILED' command 'c2 b114ea-5b53-4686-86c1-175c97ad19a8' 2016-06-05 14:14:11,206 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (DefaultQuartzScheduler2) [6067c86] Exception: java.lang.NullPointerException at org.ovirt.engine.core.utils.ovf.OvfWriter.buildDisk(OvfWriter.java:204) [utils.jar:] at org.ovirt.engine.core.utils.ovf.OvfManager.buildOvf(OvfManager.java:118) [utils.jar:] at org.ovirt.engine.core.utils.ovf.OvfManager.exportVm(OvfManager.java:29) [utils.jar:] at org.ovirt.engine.core.bll.storage.disk.image.ImagesHandler.prepareSnapshotConfigWithAlternateImage(ImagesHandler.java:829) [bll.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.lambda$updateVmConfigurationForImageChange$1(RemoveSnapshotSingleDiskLiveCommand.java:394) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:204) [utils.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.updateVmConfigurationForImageChange(RemoveSnapshotSingleDiskLiveCommand.java:392) [bll.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.handleBackwardLiveMerge(RemoveSnapshotSingleDiskLiveCommand.java:382) [bll.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.lambda$syncDbRecords$0(RemoveSnapshotSingleDiskLiveCommand.java:303) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:204) [utils.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.syncDbRecords(RemoveSnapshotSingleDiskLiveCommand.java:291) [bll.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.onFailed(RemoveSnapshotSingleDiskLiveCommand.java:447) [bll.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback.onFailed(RemoveSnapshotSingleDiskLiveCommandCallback.java:18) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.endCallback(CommandCallbacksPoller.java:56) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) [bll.jar:] at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source) [:1.8.0_91] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_91] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_91] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_91] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91] 2016-06-05 14:14:12,249 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [6067c86] Command 'RemoveSnapshot' id: 'fec75b49-40a7-4439-8dde- d9db3bb221b2' child commands '[b77126ce-f696-4bdc-a0f6-b411841c5ad4, 24b9ea03-2be1-4f0a-a1ab-3ba54c209f58, c2b114ea-5b53-4686-86c1-175c97ad19a8]' executions were completed, status 'FAILED' 2016-06-05 14:14:13,520 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (DefaultQuartzScheduler10) [6067c86] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnap shotCommand' with failure. 2016-06-05 14:14:13,724 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [6067c86] Correlation ID: 69d26cca, Job ID: d1762475-b0e4-44c6-9d15-4d36f2b1a4e0, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'sbu2' for VM 'vm1'. VDSM log: -------------------------------- Thread-430::DEBUG::2016-06-05 14:13:30,758::resourceManager::241::Storage.ResourceManager.Request::(grant) ResName=`Storage.7590d3ff-797a-4042-90f8-4094e6d01665`ReqID=`8e5f76d1-4683-46d5-b823-43ad815e8165`::Granted request Thread-430::DEBUG::2016-06-05 14:13:30,758::task::829::Storage.TaskManager.Task::(resourceAcquired) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::_resourcesAcquired: Storage.7590d3ff-797a-4042-90f8-4094e6d01665 (shared) Thread-430::DEBUG::2016-06-05 14:13:30,758::task::995::Storage.TaskManager.Task::(_decref) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::ref 1 aborting False Thread-430::DEBUG::2016-06-05 14:13:30,759::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a516000c8|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 7590d3ff-797a-4042-90f8-4094e6d01665 (cwd None) Thread-430::DEBUG::2016-06-05 14:13:30,961::lvm::288::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n'; <rc> = 0 Thread-430::DEBUG::2016-06-05 14:13:30,963::lvm::463::Storage.LVM::(_reloadlvs) lvs reloaded Thread-430::WARNING::2016-06-05 14:13:30,963::lvm::609::Storage.LVM::(getLv) lv: 9f680fdb-fe8a-462c-9977-7aec183ec6af not found in lvs vg: 7590d3ff-797a-4042-90f8-4094e6d01665 response Thread-430::ERROR::2016-06-05 14:13:30,963::task::868::Storage.TaskManager.Task::(_setError) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 875, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3039, in getVolumeInfo volUUID=volUUID).getInfo() File "/usr/share/vdsm/storage/sd.py", line 619, in produceVolume volUUID) File "/usr/share/vdsm/storage/blockVolume.py", line 370, in __init__ manifest = self.manifestClass(repoPath, sdUUID, imgUUID, volUUID) File "/usr/share/vdsm/storage/blockVolume.py", line 72, in __init__ volUUID) File "/usr/share/vdsm/storage/volume.py", line 80, in __init__ self.validate() File "/usr/share/vdsm/storage/blockVolume.py", line 156, in validate raise se.VolumeDoesNotExist(self.volUUID) VolumeDoesNotExist: Volume does not exist: ('9f680fdb-fe8a-462c-9977-7aec183ec6af',) Thread-430::DEBUG::2016-06-05 14:13:30,989::task::887::Storage.TaskManager.Task::(_run) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Task._run: 18cc50f9-e374-46bb-9389-e48f192628c8 ('7590d3ff-797a-4042-90f8-4094e6d01665', '3f76ceea-84c5-4893-8c7d-3bf2d8eea69e', '9c3f4477-4041-4c4f-8add-b547db25c839', '9f680fdb-fe8a-462c-9977-7aec183ec6af') {} failed - stopping task Thread-430::DEBUG::2016-06-05 14:13:30,989::task::1248::Storage.TaskManager.Task::(stop) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::stopping in state preparing (force False) Thread-430::DEBUG::2016-06-05 14:13:30,989::task::995::Storage.TaskManager.Task::(_decref) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::ref 1 aborting True Thread-430::INFO::2016-06-05 14:13:30,989::task::1173::Storage.TaskManager.Task::(prepare) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::aborting: Task is aborted: 'Volume does not exist' - code 201 Thread-430::DEBUG::2016-06-05 14:13:30,990::task::1178::Storage.TaskManager.Task::(prepare) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Prepare: aborted: Volume does not exist Thread-430::DEBUG::2016-06-05 14:13:30,990::task::995::Storage.TaskManager.Task::(_decref) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::ref 0 aborting True Thread-430::DEBUG::2016-06-05 14:13:30,990::task::930::Storage.TaskManager.Task::(_doAbort) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::Task._doAbort: force False Thread-430::DEBUG::2016-06-05 14:13:30,990::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-430::DEBUG::2016-06-05 14:13:30,990::task::597::Storage.TaskManager.Task::(_updateState) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::moving from state preparing -> state aborting Thread-430::DEBUG::2016-06-05 14:13:30,990::task::552::Storage.TaskManager.Task::(__state_aborting) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::_aborting: recover policy none Thread-430::DEBUG::2016-06-05 14:13:30,990::task::597::Storage.TaskManager.Task::(_updateState) Task=`18cc50f9-e374-46bb-9389-e48f192628c8`::moving from state aborting -> state failed Thread-430::DEBUG::2016-06-05 14:13:30,991::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.7590d3ff-797a-4042-90f8-4094e6d01665': < ResourceRef 'Storage.7590d3ff-797a-4042-90f8-4094e6d01665', isValid: 'True' obj: 'None'>} Thread-430::DEBUG::2016-06-05 14:13:30,991::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Tested with the following code: ------------------------------------------ rhevm-4.0.0.4-0.1.el7ev.noarch vdsm-4.18.2-0.el7ev.x86_64 Tested with the following scenario: ------------------------------------- 1.Created Vm with 3 disks, 1 iscsi - preallocated, 1 iscsi - thin and 1 nfs - thin 2. Created snapshots sd1, sd2, sd3 and sd4 while VM is dowm 3. Deleted sd2 successfully 4. Deleted sd3 successfully 5. Started the VM 6. Created snapshot su1 7. Created snapshot su2 8. Created snapshot su3 9. Created snapshot su4 10. Deleted snapshot sd1 successfully 11. Deleted snapshot su2 successfully MOVING TO VERIFIED!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html