Created attachment 951842 [details] engine vdsm and server logs Description of problem: On a LIVE VM with 4 snapshots, successfully deleted 2nd and 3rd snapshots but failed to delete the first snapshot created Version-Release number of selected component (if applicable): rhevm-3.5.0-0.17.beta.el6ev.noarch vdsm-4.16.7.2-1.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create a vm with 4 disks (2 thin provisioned and 2 preallocated nfs disks) 2. Created snapshots 1, 2, 3 and 4 3. Deleted snapshot 3 successfully 4. Deleted snapshot 2 successfully 5. Deleted snapshot 4 successfully 6. Failed to delete snapshot 1 with 'failed to MergVDS' Actual results: Failed to delete the first snapshot Expected results: Should be able to delete the first snapshot Additional info: 3ff7d11, baseImageId=76005a6c-4c69-4c02-8523-29322dbd9e93, topImageId=dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11, bandwidth=0}), log id: 15eb3238 2014-10-29 17:41:39,819 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] Failed in MergeVDS method 2014-10-29 17:41:39,821 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]] 2014-10-29 17:41:39,821 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] HostName = nott-vds1 2014-10-29 17:41:39,822 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=19e7cce6-83f6-4919-befa-c80ee1a8013c, imageId=82df39aa-1921-450b-b3bf-86a9500e7dba, baseImageId=ced6bf4d-510c-4f65-8945-0e664a1ebee6, topImageId=82df39aa-1921-450b-b3bf-86a9500e7dba, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 2014-10-29 17:41:39,822 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-4) [7b8b187b] FINISH, MergeVDSCommand, log id: 41932952 2014-10-29 17:41:39,822 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-4) [7b8b187b] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52) 2014-10-29 17:41:39,831 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-4) [7b8b187b] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand. 2014-10-29 17:41:39,955 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] Failed in MergeVDS method 2014-10-29 17:41:39,955 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]] 2014-10-29 17:41:39,955 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] HostName = nott-vds1 2014-10-29 17:41:39,955 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=0e51e927-aefd-438c-a50a-b40f1517e511, imageId=0316d1b9-bcf2-4990-8d07-99dda9ba6afc, baseImageId=c9858f88-804d-4e76-999d-8747e7a7a593, topImageId=0316d1b9-bcf2-4990-8d07-99dda9ba6afc, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 2014-10-29 17:41:39,955 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-9) [635edaea] FINISH, MergeVDSCommand, log id: 40d8fa93 2014-10-29 17:41:39,955 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-9) [635edaea] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52) 2014-10-29 17:41:39,965 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-9) [635edaea] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand. 2014-10-29 17:41:40,004 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] Failed in MergeVDS method 2014-10-29 17:41:40,005 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]] 2014-10-29 17:41:40,005 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] HostName = nott-vds1 2014-10-29 17:41:40,005 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=058c45be-e419-4cdc-940d-511b2d6c597c, imageId=cb2cdf5d-4566-43ef-b0a3-25c786c2252f, baseImageId=f2f7a9dd-70aa-4230-b419-df92e8132f6c, topImageId=cb2cdf5d-4566-43ef-b0a3-25c786c2252f, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 2014-10-29 17:41:40,005 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-8) [5942e69a] FINISH, MergeVDSCommand, log id: 8f6655c 2014-10-29 17:41:40,005 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-8) [5942e69a] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52) 2014-10-29 17:41:40,015 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-8) [5942e69a] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand. 2014-10-29 17:41:40,048 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] Failed in MergeVDS method 2014-10-29 17:41:40,048 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=52, mMessage=Merge failed]] 2014-10-29 17:41:40,048 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] HostName = nott-vds1 2014-10-29 17:41:40,048 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] Command MergeVDSCommand(HostName = nott-vds1, MergeVDSCommandParameters{HostId = edf26db5-6065-42dc-89c7-71936972b89c, vmId=878aeafd-0218-4ba0-b282-16a3d7bd3f1e, storagePoolId=b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd, storageDomainId=b473691a-e6f6-4c44-81cb-6bee354608fd, imageGroupId=ecef5ea8-d54f-443a-a76c-43c7241086ad, imageId=dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11, baseImageId=76005a6c-4c69-4c02-8523-29322dbd9e93, topImageId=dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 2014-10-29 17:41:40,049 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-2) [1d225af9] FINISH, MergeVDSCommand, log id: 15eb3238 2014-10-29 17:41:40,049 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-2) [1d225af9] Command org.ovirt.engine.core.bll.MergeCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52) 2014-10-29 17:41:40,063 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-2) [1d225af9] Transaction rolled-back for command: org.ovirt.engine.core.bll.MergeCommand. 2014-10-29 17:41:40,667 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-98) task id b6080ce4-9931-4889-81c7-45a66e9ebf16 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis. 2014-10-29 17:41:40,668 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-98) task id 3251241b-c19d-443c-9885-b1ef3c94b863 is in pre-polling period and should not be polled. Pre-polling period is 60,000: millis. 2014-10-29 17:41:59,175 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-32) [63bc0060] All Live Merge child commands have completed, status FAILED 2014-10-29 17:41:59,197 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [5610a34e] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images ced6bf4d-510c-4f65-8945-0e664a1ebee6..82df39aa-1921-450b-b3bf-86a9500e7dba failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2014-10-29 17:41:59,235 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [43419e49] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images c9858f88-804d-4e76-999d-8747e7a7a593..0316d1b9-bcf2-4990-8d07-99dda9ba6afc failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2014-10-29 17:41:59,259 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [4f7f94db] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images f2f7a9dd-70aa-4230-b419-df92e8132f6c..cb2cdf5d-4566-43ef-b0a3-25c786c2252f failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2014-10-29 17:41:59,309 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-32) [227601da] Merging of snapshot 8c6fad2c-a592-417a-a7ee-d0a13028137d images 76005a6c-4c69-4c02-8523-29322dbd9e93..dfbbb3c4-d893-47f9-b7e2-abefc3ff7d11 failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2014-10-29 17:42:00,670 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-91) task id b6080ce4-9931-4889-81c7-45a66e9ebf16 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis. 2014-10-29 17:42:00,671 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-91) task id 3251241b-c19d-443c-9885-b1ef3c94b863 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis. 2014-10-29 17:42:09,318 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-96) [63bc0060] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand 2014-10-29 17:42:09,382 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-96) [63bc0060] Correlation ID: 63bc0060, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '1' for VM 'nfs1'. 2014-10-29 17:42:10,672 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-48) [1453b958] task id b6080ce4-9931-4889-81c7-45a66e9ebf16 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis. 2014-10-29 17:42:10,672 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-48) [1453b958] task id 3251241b-c19d-443c-9885-b1ef3c94b863 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis.
The Live Merge flows operate on the SPM host as well as the actual host that is running the VM. If the VM is not running on the SPM host I need to view the vdsm.log from two different hosts. It seems that you only attached the vdsm log from the SPM host. I cannot even see the vdsm side of the error you are reporting. I'll need both logs in order to investigate.
Created attachment 952937 [details] NEW engine vdsm and server logs Add new logs. Had only 1 host SPM. I ran the above scenario again as follows: Steps to Reproduce: 1. Create a vm with 4 disks (2 thin provisioned and 2 preallocated nfs disks) 2. Created snapshots 1, 2, 3 and 4 3. Deleted snapshot 3 successfully 4. Deleted snapshot 2 successfully 5. Deleted snapshot 4 FAILED! 6. Failed to delete snapshot 4 with 'failed to MergVDS' From engine log: ------------------ 2014-11-02 18:19:26,266 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-9) Correlation ID: 5f948463, Job ID: 590bfb8d-2c9f-4577-866d-826263e1d40e, Call Stack: null, Custom Event ID: -1, M essage: Snapshot 'sn4' deletion for VM 'nfs_2' was initiated by admin. . . . 2014-11-02 18:20:16,466 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-80) [5f948463] All Live Merge child commands have completed, status FAILED 2014-11-02 18:20:26,472 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-8) [5f948463] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand 2014-11-02 18:20:26,552 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-8) [5f948463] Correlation ID: 5f948463, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'sn4' for VM 'nfs_2'. 2014-11-02 18:21:22,664 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler_Worker-50) [191cf4f] Setting new tasks map. The map contains now 0 tasks 2014-11-02 18:21:22,664 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler_Worker-50) [191cf4f] Cleared all tasks of pool b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd. Output from the vdsm: -------------------- grep 'Starting merge with jobUUID' /var/log/vdsm/vdsm.log Thread-6478::INFO::2014-11-02 18:08:21,075::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='4aaf856c-768f-47aa-bc2c-3fc7643f9f5e' Thread-6479::INFO::2014-11-02 18:08:21,617::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='1013eeec-0096-49ae-bf16-14e731eec477' Thread-6480::INFO::2014-11-02 18:08:21,650::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='c8c7a767-6723-4430-8984-bf75bae54f0e' Thread-6481::INFO::2014-11-02 18:08:21,688::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='1a565aef-28c6-4ddd-846d-c0e19020072e' Thread-6679::INFO::2014-11-02 18:11:43,503::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='513773ca-6021-490e-9e0e-2029a1c15cfd' Thread-6680::INFO::2014-11-02 18:11:43,838::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='33d30b57-83f9-4c38-8c17-c41eaf6ca5c4' Thread-6682::INFO::2014-11-02 18:11:44,038::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='6a54f7ae-97ad-4385-bb1c-c6dd83c4dd29' Thread-6681::INFO::2014-11-02 18:11:44,210::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='4bc99f3a-22b6-4437-b056-ae7a21290874' Thread-7130::INFO::2014-11-02 18:19:55,895::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='d45ca947-dcd1-4bc8-8ac8-4c2f544262c2' Thread-7131::INFO::2014-11-02 18:19:56,285::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='312ff5e6-45a5-4c23-b817-d161a087fc07' Thread-7132::INFO::2014-11-02 18:19:56,455::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='121a0bd2-63bf-44e5-ad0e-4022aeda326b' Thread-7133::INFO::2014-11-02 18:19:56,572::vm::5741::vm.Vm::(merge) vmId=`dea46acd-93c2-4974-934d-d0cddf1313e5`::Starting merge with jobUUID='0edcd414-4bd9-43db-8c96-ffea0a21c578' Thread-22::ERROR::2014-11-02 18:09:16,886::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 1670633f-ed37-485e-98e8-d9c6f8687754 Thread-22::ERROR::2014-11-02 18:09:16,896::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 1670633f-ed37-485e-98e8-d9c6f8687754 Thread-22::ERROR::2014-11-02 18:14:18,178::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 1670633f-ed37-485e-98e8-d9c6f8687754 Thread-22::ERROR::2014-11-02 18:14:18,183::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 1670633f-ed37-485e-98e8-d9c6f8687754 Thread-22::ERROR::2014-11-02 18:19:19,589::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 1670633f-ed37-485e-98e8-d9c6f8687754 Thread-22::ERROR::2014-11-02 18:19:19,590::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 1670633f-ed37-485e-98e8-d9c6f8687754 8b3fd19c-64de-4edf-8a8a-e44a0a2ac458::ERROR::2014-11-02 18:20:16,446::task::866::Storage.TaskManager.Task::(_setError) Task=`8b3fd19c-64de-4edf-8a8a-e44a0a2ac458`::Unexpected error
After further investigation, it appears that the merge is failing in the final stages when we are trying to delete the merged volumes. We have two SPM verbs submitted by engine around the same time to delete two different volumes: --- Thread-7161::DEBUG::2014-11-02 18:20:15,864::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Image.deleteVolumes' in bridge with {u'postZero': u'false', u'force': u'false', u'imageID': u'c93ffb9a-273e-4ee9-b1bc-a291e1699c15', u'volumeList': [u'f4ef865d-f0c8-4646-bdb6-535b379125a5'], u'storagepoolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'storagedomainID': u'b473691a-e6f6-4c44-81cb-6bee354608fd'} Thread-7161::DEBUG::2014-11-02 18:20:15,865::task::595::Storage.TaskManager.Task::(_updateState) Task=`2789d215-e4ee-443a-a941-28028794cf7a`::moving from state init -> state preparing Thread-7163::DEBUG::2014-11-02 18:20:15,967::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Image.deleteVolumes' in bridge with {u'postZero': u'false', u'force': u'false', u'imageID': u'e1d8655f-e4ce-47b1-bb75-22fd270571f7', u'volumeList': [u'2850a36b-b0dd-4c16-a9e1-6aca7e03327f'], u'storagepoolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'storagedomainID': u'b473691a-e6f6-4c44-81cb-6bee354608fd'} Thread-7163::DEBUG::2014-11-02 18:20:15,975::task::595::Storage.TaskManager.Task::(_updateState) Task=`8b3fd19c-64de-4edf-8a8a-e44a0a2ac458`::moving from state init -> state preparing --- This is fine, but something really strange happens. It seems that tasks 2789d215-e4ee-443a-a941-28028794cf7a and 8b3fd19c-64de-4edf-8a8a-e44a0a2ac458 both try to delete the same volume (f4ef865d-f0c8-4646-bdb6-535b379125a5) at the same time. I am starting to think that somehow one task clobbered the other.
Created attachment 953235 [details] filtered vdsm.log showing only colliding tasks This is a filtered vdsm.log showing the two colliding spm tasks.
I think I found the real problem (it's not colliding tasks). The function FileVolume::getChildren() builds a pattern and file list and then passes both to an OutOfProcess grep command. Here lies a race condition. Since we have two deletes happening at nearly the same time, the file f4ef865d-f0c8-4646-bdb6-535b379125a5.meta gets picked up by getChildren() call of task 8b3fd19c-64de-4edf-8a8a-e44a0a2ac458 which is trying to delete volume 2850a36b-b0dd-4c16-a9e1-6aca7e03327f. Meanwhile, task 2789d215-e4ee-443a-a941-28028794cf7a is deleting volume f4ef865d-f0c8-4646-bdb6-535b379125a5. The metadata file is removed before the grep command has a chance to open it. 1. Require deleteVolumes to take an exclusive lock on the SD 2. Change the logic of getChildren() to handle disappearing files. I'd say the exclusive lock on the Image would prevent any new children from being added during a delete. 3. Make a "gentlemen's agreement" with engine requiring live merge deletes to happen sequentially. Federico, do you have a recommendation on how to fix this properly?
(In reply to Adam Litke from comment #5) > Federico, do you have a recommendation on how to fix this properly? This seems a duplicate of bug 1069610. I think that having a consistent exclusive lock on the image should be sufficient (it is actually needed anyway).
(In reply to Federico Simoncelli from comment #6) > (In reply to Adam Litke from comment #5) > > Federico, do you have a recommendation on how to fix this properly? > > This seems a duplicate of bug 1069610. Yes, agree. > I think that having a consistent exclusive lock on the image should be > sufficient (it is actually needed anyway). But in this instance, the missing file belongs to a different image so if two deletes (on different images) are happening simultaneously we'd still hit this race wouldn't we?
*** This bug has been marked as a duplicate of bug 1069610 ***
Reopening since we want to fix this in 3.5 and the other issue from bug 1069610 was in 3.4.
I was able to delete the snapshots as per the above scenario successfully and live merge was succsessfull each time. Performed this on VMs with nfs, thin and preallocated disks. Moving to Verified
Verified with: V3.6 ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch vdsm-4.17.0-632.git19a83a2.el7.x86_64 Verified using the following scenario: Steps to Reproduce: 1. Create a vm with 4 disks (2 thin provisioned and 2 preallocated nfs disks) 2. Created snapshots 1, 2, 3 and 4 3. Deleted snapshot 3 successfully 4. Deleted snapshot 2 successfully 5. Deleted snapshot 4 successfully 6. Deleted snapshot 1 successfully Moving to verified!
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE