Bug 1158563 - Failed to Delete First NFS snapshot with live merge
Summary: Failed to Delete First NFS snapshot with live merge
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Adam Litke
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks: 1177220
TreeView+ depends on / blocked
 
Reported: 2014-10-29 16:20 UTC by Kevin Alon Goldblatt
Modified: 2016-03-10 12:04 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1177220 (view as bug list)
Environment:
Last Closed: 2014-11-10 15:25:35 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine vdsm and server logs (281.64 KB, application/x-gzip)
2014-10-29 16:20 UTC, Kevin Alon Goldblatt
no flags Details
NEW engine vdsm and server logs (662.55 KB, application/x-gzip)
2014-11-02 16:43 UTC, Kevin Alon Goldblatt
no flags Details
filtered vdsm.log showing only colliding tasks (87.91 KB, text/plain)
2014-11-03 19:20 UTC, Adam Litke
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1069610 0 high CLOSED Failed to remove a snapshot after restarting vdsm while creating this snapshot 2021-02-22 00:41:40 UTC
oVirt gerrit 39551 0 master MERGED spec: Update libvirt-daemon and libvirt-python requirements Never
oVirt gerrit 39670 0 ovirt-3.5 MERGED spec: Simplify libvirt dependencies Never
oVirt gerrit 39671 0 ovirt-3.5 MERGED spec: Update libvirt-daemon and libvirt-python requirements Never

Internal Links: 1069610

Description Kevin Alon Goldblatt 2014-10-29 16:20:11 UTC
Created attachment 951842 [details]
engine vdsm and server logs

Description of problem:
On a LIVE VM with 4 snapshots, successfully deleted 2nd and 3rd snapshots but failed to delete the first snapshot created

Version-Release number of selected component (if applicable):
rhevm-3.5.0-0.17.beta.el6ev.noarch
vdsm-4.16.7.2-1.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create a vm with 4 disks (2 thin provisioned and 2 preallocated nfs disks)
2. Created snapshots 1, 2, 3 and 4 
3. Deleted snapshot 3 successfully
4. Deleted snapshot 2 successfully
5. Deleted snapshot 4 successfully
6. Failed to delete snapshot 1 with 'failed to MergVDS'

Actual results:
Failed to delete the first snapshot

Expected results:
Should be able to delete the first snapshot

Additional info:
3ff7d11, baseImageId=76005a6c-4c69-4c02-8523-29322dbd9e93, topImageId=dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11, bandwidth=0}), log id: 15eb3238
2014-10-29 17:41:39,819 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] Failed in MergeVDS method
2014-10-29 17:41:39,821 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]]
2014-10-29 17:41:39,821 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] HostName = nott-vds1
2014-10-29 17:41:39,822 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=19e7cce6-83f6-4919-befa-c80ee1a8013c, imageId=82df39aa-1921-450b-b3bf-86a9500e7dba, baseImageId=ced6bf4d-510c-4f65-8945-0e664a1ebee6, topImageId=82df39aa-1921-450b-b3bf-86a9500e7dba, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52
2014-10-29 17:41:39,822 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] FINISH, MergeVDSCommand, log id: 41932952
2014-10-29 17:41:39,822 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-4) [7b8b187b] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52)
2014-10-29 17:41:39,831 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-4) [7b8b187b] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand.
2014-10-29 17:41:39,955 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] Failed in MergeVDS method
2014-10-29 17:41:39,955 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]]
2014-10-29 17:41:39,955 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] HostName = nott-vds1
2014-10-29 17:41:39,955 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=0e51e927-aefd-438c-a50a-b40f1517e511, imageId=0316d1b9-bcf2-4990-8d07-99dda9ba6afc, baseImageId=c9858f88-804d-4e76-999d-8747e7a7a593, topImageId=0316d1b9-bcf2-4990-8d07-99dda9ba6afc, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52
2014-10-29 17:41:39,955 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] FINISH, MergeVDSCommand, log id: 40d8fa93
2014-10-29 17:41:39,955 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-9) [635edaea] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52)
2014-10-29 17:41:39,965 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-9) [635edaea] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand.
2014-10-29 17:41:40,004 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] Failed in MergeVDS method
2014-10-29 17:41:40,005 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]]
2014-10-29 17:41:40,005 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] HostName = nott-vds1
2014-10-29 17:41:40,005 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=058c45be-e419-4cdc-940d-511b2d6c597c, imageId=cb2cdf5d-4566-43ef-b0a3-25c786c2252f, baseImageId=f2f7a9dd-70aa-4230-b419-df92e8132f6c, topImageId=cb2cdf5d-4566-43ef-b0a3-25c786c2252f, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52
2014-10-29 17:41:40,005 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] FINISH, MergeVDSCommand, log id: 8f6655c
2014-10-29 17:41:40,005 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-8) [5942e69a] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52)
2014-10-29 17:41:40,015 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-8) [5942e69a] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand.
2014-10-29 17:41:40,048 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] Failed in MergeVDS method
2014-10-29 17:41:40,048 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]]
2014-10-29 17:41:40,048 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] HostName = nott-vds1
2014-10-29 17:41:40,048 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=ecef5ea8-d54f-443a-a76c-43c7241086ad, imageId=dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11, baseImageId=76005a6c-4c69-4c02-8523-29322dbd9e93, topImageId=dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52
2014-10-29 17:41:40,049 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] FINISH, MergeVDSCommand, log id: 15eb3238
2014-10-29 17:41:40,049 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-2) [1d225af9] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52)
2014-10-29 17:41:40,063 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-2) [1d225af9] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand.
2014-10-29 17:41:40,667 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-98) task id b6080ce4-9931-4889-81c7-45a66e9ebf16 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis. 
2014-10-29 17:41:40,668 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-98) task id 3251241b-c19d-443c-9885-b1ef3c94b863 is in pre-polling  period and should not be polled. Pre-polling period is 60,000:
millis. 
2014-10-29 17:41:59,175 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-32) [63bc0060] All Live Merge child commands have completed, status FAILED
2014-10-29 17:41:59,197 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [5610a34e] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images ced6bf4d-510c-4f65-8945-0e664a1ebee6..82df39aa-1921-450b-b3bf-86a9500e7dba failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2014-10-29 17:41:59,235 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [43419e49] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images c9858f88-804d-4e76-999d-8747e7a7a593..0316d1b9-bcf2-4990-8d07-99dda9ba6afc failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2014-10-29 17:41:59,259 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [4f7f94db] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images f2f7a9dd-70aa-4230-b419-df92e8132f6c..cb2cdf5d-4566-43ef-b0a3-25c786c2252f failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2014-10-29 17:41:59,309 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [227601da] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images 76005a6c-4c69-4c02-8523-29322dbd9e93..dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11 failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2014-10-29 17:42:00,670 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-91) task id b6080ce4-9931-4889-81c7-45a66e9ebf16 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis. 
2014-10-29 17:42:00,671 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-91) task id 3251241b-c19d-443c-9885-b1ef3c94b863 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis. 
2014-10-29 17:42:09,318 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-96) [63bc0060] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand
2014-10-29 17:42:09,382 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-96) [63bc0060] Correlation ID: 63bc0060, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '1' for VM 'nfs1'.
2014-10-29 17:42:10,672 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-48) [1453b958] task id b6080ce4-9931-4889-81c7-45a66e9ebf16 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis. 
2014-10-29 17:42:10,672 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-48) [1453b958] task id 3251241b-c19d-443c-9885-b1ef3c94b863 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis.

Comment 1 Adam Litke 2014-10-30 13:42:50 UTC
The Live Merge flows operate on the SPM host as well as the actual host that is running the VM.  If the VM is not running on the SPM host I need to view the vdsm.log from two different hosts.  It seems that you only attached the vdsm log from the SPM host.  I cannot even see the vdsm side of the error you are reporting.  I'll need both logs in order to investigate.

Comment 2 Kevin Alon Goldblatt 2014-11-02 16:43:53 UTC
Created attachment 952937 [details]
NEW engine vdsm and server logs

Add new logs. Had only 1 host SPM.

I ran the above scenario again as follows:
Steps to Reproduce:
1. Create a vm with 4 disks (2 thin provisioned and 2 preallocated nfs disks)
2. Created snapshots 1, 2, 3 and 4 
3. Deleted snapshot 3 successfully
4. Deleted snapshot 2 successfully
5. Deleted snapshot 4 FAILED!
6. Failed to delete snapshot 4 with 'failed to MergVDS'

From engine log:
------------------
2014-11-02 18:19:26,266 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-9) Correlation ID: 5f948463, Job ID: 590bfb8d-2c9f-4577-866d-826263e1d40e, Call Stack: null, Custom Event ID: -1, M
essage: Snapshot 'sn4' deletion for VM 'nfs_2' was initiated by admin.
.
.
.
2014-11-02 18:20:16,466 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-80) [5f948463] All Live Merge child commands have completed, status FAILED
2014-11-02 18:20:26,472 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-8) [5f948463] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand
2014-11-02 18:20:26,552 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-8) [5f948463] Correlation ID: 5f948463, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'sn4' for VM 'nfs_2'.
2014-11-02 18:21:22,664 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler_Worker-50) [191cf4f] Setting new tasks map. The map contains now 0 tasks
2014-11-02 18:21:22,664 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler_Worker-50) [191cf4f] Cleared all tasks of pool b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd.



Output from the vdsm:
--------------------
grep 'Starting merge with jobUUID' /var/log/vdsm/vdsm.log
Thread-6478::INFO::2014-11-02 18:08:21,075::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='4aaf856c-768f-47aa-bc2c-3fc7643f9f5e'
Thread-6479::INFO::2014-11-02 18:08:21,617::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='1013eeec-0096-49ae-bf16-14e731eec477'
Thread-6480::INFO::2014-11-02 18:08:21,650::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='c8c7a767-6723-4430-8984-bf75bae54f0e'
Thread-6481::INFO::2014-11-02 18:08:21,688::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='1a565aef-28c6-4ddd-846d-c0e19020072e'
Thread-6679::INFO::2014-11-02 18:11:43,503::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='513773ca-6021-490e-9e0e-2029a1c15cfd'
Thread-6680::INFO::2014-11-02 18:11:43,838::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='33d30b57-83f9-4c38-8c17-c41eaf6ca5c4'
Thread-6682::INFO::2014-11-02 18:11:44,038::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='6a54f7ae-97ad-4385-bb1c-c6dd83c4dd29'
Thread-6681::INFO::2014-11-02 18:11:44,210::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='4bc99f3a-22b6-4437-b056-ae7a21290874'
Thread-7130::INFO::2014-11-02 18:19:55,895::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='d45ca947-dcd1-4bc8-8ac8-4c2f544262c2'
Thread-7131::INFO::2014-11-02 18:19:56,285::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='312ff5e6-45a5-4c23-b817-d161a087fc07'
Thread-7132::INFO::2014-11-02 18:19:56,455::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='121a0bd2-63bf-44e5-ad0e-4022aeda326b'
Thread-7133::INFO::2014-11-02 18:19:56,572::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='0edcd414-4bd9-43db-8c96-ffea0a21c578'


Thread-22::ERROR::2014-11-02 18:09:16,886::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 1670633f-ed37-485e-98e8-d9c6f8687754
Thread-22::ERROR::2014-11-02 18:09:16,896::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 1670633f-ed37-485e-98e8-d9c6f8687754
Thread-22::ERROR::2014-11-02 18:14:18,178::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 1670633f-ed37-485e-98e8-d9c6f8687754
Thread-22::ERROR::2014-11-02 18:14:18,183::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 1670633f-ed37-485e-98e8-d9c6f8687754
Thread-22::ERROR::2014-11-02 18:19:19,589::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 1670633f-ed37-485e-98e8-d9c6f8687754
Thread-22::ERROR::2014-11-02 18:19:19,590::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 1670633f-ed37-485e-98e8-d9c6f8687754
8b3fd19c-64de-4edf-8a8a-e44a0a2ac458::ERROR::2014-11-02 18:20:16,446::task::866::Storage.TaskManager.Task::(_setError) Task=`8b3fd19c-64de-4edf-8a8a-e44a0a2ac458`::Unexpected error

Comment 3 Adam Litke 2014-11-03 19:17:03 UTC
After further investigation, it appears that the merge is failing in the final stages when we are trying to delete the merged volumes.

We have two SPM verbs submitted by engine around the same time to delete two different volumes:
---
Thread-7161::DEBUG::2014-11-02 18:20:15,864::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Image.deleteVolumes' in bridge with {u'postZero': u'false', u'force': u'false', u'imageID': u'c93ffb9a-273e-4ee9-b1bc-a291e1699c15', u'volumeList': [u'f4ef865d-f0c8-4646-bdb6-535b379125a5'], u'storagepoolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'storagedomainID': u'b473691a-e6f6-4c44-81cb-6bee354608fd'}
Thread-7161::DEBUG::2014-11-02 18:20:15,865::task::595::Storage.TaskManager.Task::(_updateState) Task=`2789d215-e4ee-443a-a941-28028794cf7a`::moving from state init -> state preparing

Thread-7163::DEBUG::2014-11-02 18:20:15,967::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Image.deleteVolumes' in bridge with {u'postZero': u'false', u'force': u'false', u'imageID': u'e1d8655f-e4ce-47b1-bb75-22fd270571f7', u'volumeList': [u'2850a36b-b0dd-4c16-a9e1-6aca7e03327f'], u'storagepoolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'storagedomainID': u'b473691a-e6f6-4c44-81cb-6bee354608fd'}
Thread-7163::DEBUG::2014-11-02 18:20:15,975::task::595::Storage.TaskManager.Task::(_updateState) Task=`8b3fd19c-64de-4edf-8a8a-e44a0a2ac458`::moving from state init -> state preparing
---

This is fine, but something really strange happens.  It seems that tasks 2789d215-e4ee-443a-a941-28028794cf7a and 8b3fd19c-64de-4edf-8a8a-e44a0a2ac458 both try to delete the same volume (f4ef865d-f0c8-4646-bdb6-535b379125a5) at the same time.  I am starting to think that somehow one task clobbered the other.

Comment 4 Adam Litke 2014-11-03 19:20:25 UTC
Created attachment 953235 [details]
filtered vdsm.log showing only colliding tasks

This is a filtered vdsm.log showing the two colliding spm tasks.

Comment 5 Adam Litke 2014-11-06 21:37:56 UTC
I think I found the real problem (it's not colliding tasks).

The function FileVolume::getChildren() builds a pattern and file list and then passes both to an OutOfProcess grep command.  Here lies a race condition.  Since we have two deletes happening at nearly the same time, the file f4ef865d-f0c8-4646-bdb6-535b379125a5.meta gets picked up by getChildren() call of task 8b3fd19c-64de-4edf-8a8a-e44a0a2ac458 which is trying to delete volume 2850a36b-b0dd-4c16-a9e1-6aca7e03327f.  Meanwhile, task 2789d215-e4ee-443a-a941-28028794cf7a is deleting volume f4ef865d-f0c8-4646-bdb6-535b379125a5.  The metadata file is removed before the grep command has a chance to open it.

1. Require deleteVolumes to take an exclusive lock on the SD
2. Change the logic of getChildren() to handle disappearing files.  I'd say the exclusive lock on the Image would prevent any new children from being added during a delete.
3. Make a "gentlemen's agreement" with engine requiring live merge deletes to happen sequentially.

Federico, do you have a recommendation on how to fix this properly?

Comment 6 Federico Simoncelli 2014-11-10 15:10:23 UTC
(In reply to Adam Litke from comment #5)
> Federico, do you have a recommendation on how to fix this properly?

This seems a duplicate of bug 1069610.

I think that having a consistent exclusive lock on the image should be sufficient (it is actually needed anyway).

Comment 7 Adam Litke 2014-11-10 15:22:40 UTC
(In reply to Federico Simoncelli from comment #6)
> (In reply to Adam Litke from comment #5)
> > Federico, do you have a recommendation on how to fix this properly?
> 
> This seems a duplicate of bug 1069610.

Yes, agree.

> I think that having a consistent exclusive lock on the image should be
> sufficient (it is actually needed anyway).

But in this instance, the missing file belongs to a different image so if two deletes (on different images) are happening simultaneously we'd still hit this race wouldn't we?

Comment 8 Adam Litke 2014-11-10 15:25:35 UTC

*** This bug has been marked as a duplicate of bug 1069610 ***

Comment 9 Adam Litke 2014-11-25 21:40:38 UTC
Reopening since we want to fix this in 3.5 and the other issue from bug 1069610 was in 3.4.

Comment 10 Kevin Alon Goldblatt 2014-12-10 22:20:43 UTC
I was able to delete the snapshots as per the above scenario successfully and live merge was succsessfull each time. Performed this on VMs with nfs, thin and preallocated disks. Moving to Verified

Comment 14 Kevin Alon Goldblatt 2015-04-28 10:44:14 UTC
Verified with:
V3.6
ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch
vdsm-4.17.0-632.git19a83a2.el7.x86_64

Verified using the following scenario:
Steps to Reproduce:
1. Create a vm with 4 disks (2 thin provisioned and 2 preallocated nfs disks)
2. Created snapshots 1, 2, 3 and 4 
3. Deleted snapshot 3 successfully
4. Deleted snapshot 2 successfully
5. Deleted snapshot 4 successfully
6. Deleted snapshot 1 successfully

Moving to verified!

Comment 15 Allon Mureinik 2016-03-10 10:43:08 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 16 Allon Mureinik 2016-03-10 10:47:32 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 17 Allon Mureinik 2016-03-10 12:04:03 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE


Note You need to log in before you can comment on or make changes to this bug.