Hide Forgot
Description of problem: VM disks become illegal after failed Live Merge Version-Release number of selected component (if applicable): ver 3.5.1 vt14.3 rhevm-3.5.1-0.4.el6ev.noarch vdsm-4.16.13.1-1.el7ev.x86_64 How reproducible: Ran this once Steps to Reproduce: 1. Create a VM with several disks including 2 block preallocated and 1 thin and nfs 2. Started the VM 3. Created 3 snapshots snsa1, snsa2, snsa3 4. Ran a Live Storage Migration operation by selecting all the disks of the vm and moving them to new storage domains. While the disks are locked BLOCKK the storage from the host using IPTABLES >>>>> The Live storage migration fails as expected. 5. Unblock the storage on the host using IPTABLES 6. Started a Deleted snapshot operation >>>>The LIve Storage Delete / Merge operations fails 7. Running a Live Snapshot Delete Merge operation after a successful Live Storage Migration operation works fine. Actual results: After a Failed Live Storage Migration operation a Live Snapshot Delete Merge operation fails Expected results: After a failed Live Storage Migration operation the system should rollback and all further operations on the storage should work Additional info: THERE IS A 3 HOUR TIME DIFFERENCE BETWEEN THE ENGINE AND THE HOST ENGINE log LIVE MIGRATION STARTED HERE----------------------------------------------------------------------------------------------- 2015-04-21 17:30:12,800 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-29) Correlation ID: 76723cae, Job ID: 0638ac8a-3423-4373-af15-6649513080f7, Ca ll Stack: null, Custom Event ID: -1, Message: Snapshot 'Auto-generated for Live Storage Migration' creation for VM 'vm16' was initiated by admin@internal. ERRORS REPORTED WHEN STORAGE IS BLOCKED ON HOST USING IP TABLES------------------------------------ 2015-04-21 17:32:54,585 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-6) [5a07c591] Failed in HSMGetAllTasksStatusesVDS method 2015-04-21 17:32:54,586 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-6) [5a07c591] Failed in HSMGetAllTasksStatusesVDS method 2015-04-21 17:32:54,591 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-6) [5a07c591] BaseAsyncTask::onTaskEndSuccess: Task 6a46d1cf-a49f-4d8e-9e58-f781b05043cd (Parent Comman d LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. 2015-04-21 17:32:54,591 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-6) [5a07c591] Task with DB Task ID 3b6dad14-fed8-40b3-9b62-9c11dbd1ecc7 and VDSM Task ID 7e5db0a8-4 886-4dad-bddb-57f04950240f is in state Polling. End action for command ecc2e7c1-b82a-4a53-9042-8e624e5452f6 will proceed when all the entitys tasks are completed. 2015-04-21 17:32:54,591 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-6) [5a07c591] SPMAsyncTask::PollTask: Polling task 5ace41e8-feb4-43e0-8f47-5a5281001322 (Parent Command LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'cleanFailure'. 2015-04-21 17:32:54,596 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-6) [5a07c591] BaseAsyncTask::logEndTaskFailure: Task 5ace41e8-feb4-43e0-8f47-5a5281001322 (Parent Comma nd LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanFailure -- Message: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Replace Logical Volume tag error, code = 572, -- Exception: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Replace Logical Volume tag error, code = 572 2015-04-21 17:32:54,596 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-6) [5a07c591] Task with DB Task ID 3b6dad14-fed8-40b3-9b62-9c11dbd1ecc7 and VDSM Task ID 7e5db0a8-4 886-4dad-bddb-57f04950240f is in state Polling. End action for command ecc2e7c1-b82a-4a53-9042-8e624e5452f6 will proceed when all the entitys tasks are completed. 2015-04-21 17:32:54,597 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-6) [5a07c591] SPMAsyncTask::PollTask: Polling task 7e5db0a8-4886-4dad-bddb-57f04950240f (Parent Command LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'cleanFailure'. 2015-04-21 17:32:54,601 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-6) [5a07c591] BaseAsyncTask::logEndTaskFailure: Task 7e5db0a8-4886-4dad-bddb-57f04950240f (Parent Comma nd LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanFailure -- Message: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Replace Logical Volume tag error, code = 572, -- Exception: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Replace Logical Volume tag error, code = 572 2015-04-21 17:32:54,601 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-6) [5a07c591] CommandAsyncTask::endActionIfNecessary: All tasks of command ecc2e7c1-b82a-4a53-9042- 8e624e5452f6 has ended -> executing endAction 2015-04-21 17:32:54,602 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-6) [5a07c591] CommandAsyncTask::endAction: Ending action for 3 tasks (command ID: ecc2e7c1-b82a-4a5 3-9042-8e624e5452f6): calling endAction . 2015-04-21 17:32:54,602 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-8) [5a07c591] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction LiveMigrateVmDisks, executionIndex: 0 2015-04-21 17:32:54,722 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (org.ovirt.thread.pool-7-thread-8) [5a07c591] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisks Command 2015-04-21 17:32:54,743 ERROR [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (org.ovirt.thread.pool-7-thread-8) Ending command with failure: org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand 2015-04-21 17:32:54,756 ERROR [org.ovirt.engine.core.bll.CreateSnapshotCommand] (org.ovirt.thread.pool-7-thread-8) [20a1cafc] Ending command with failure: org.ovirt.engine.core.bll.CreateSnapshotCommand 2015-04-21 17:32:54,906 ERROR [org.ovirt.engine.core.bll.CreateSnapshotCommand] (org.ovirt.thread.pool-7-thread-8) [345b289d] Ending command with failure: org.ovirt.engine.core.bll.CreateSnapshotCommand 2015-04-21 17:32:54,908 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMRevertTaskVDSCommand] (org.ovirt.thread.pool-7-thread-8) [345b289d] START, SPMRevertTaskVDSCommand( storagePoolId = 6d96f52d-d791-4f66- 83bd-2553ca0f3012, ignoreFailoverLimit = false, taskId = 6a46d1cf-a49f-4d8e-9e58-f781b05043cd), log id: 1ea88db9 2015-04-21 17:32:54,910 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMRevertTaskVDSCommand] (org.ovirt.thread.pool-7-thread-8) [345b289d] START, HSMRevertTaskVDSCommand(HostName = nott-vds1, HostId = eaba7 988-e524-4c19-bd7b-8f9ba0bbc769, taskId=6a46d1cf-a49f-4d8e-9e58-f781b05043cd), log id: 39908cf2 2015-04-21 17:32:55,097 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMRevertTaskVDSCommand] (org.ovirt.thread.pool-7-thread-8) [345b289d] FINISH, HSMRevertTaskVDSCommand, log id: 39908cf2 2015-04-21 17:32:55,097 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMRevertTaskVDSCommand] (org.ovirt.thread.pool-7-thread-8) [345b289d] FINISH, SPMRevertTaskVDSCommand, log id: 1ea88db9 2015-04-21 17:32:55,223 ERROR [org.ovirt.engine.core.bll.CreateSnapshotCommand] (org.ovirt.thread.pool-7-thread-8) [69f56a1b] Ending command with failure: org.ovirt.engine.core.bll.CreateSnapshotCommand 2015-04-21 17:32:55,352 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-8) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Failed to complete snapshot 'Auto-generated for Live Storage Migration' creation for VM 'vm16'. 2015-04-21 17:32:55,357 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (org.ovirt.thread.pool-7-thread-8) Lock freed to object EngineLock [exclusiveLocks= key: 1c9b5c2e-23ea-480c-b67b-50b969faf021 value: DISK key: 1746d080-f1e5-4cfe-aeec-0eb5ffff92c3 value: DISK key: 1ef1327e-1e1d-489d-9db2-2f0c2537604b value: DISK , sharedLocks= key: 64b9a769-753a-4a51-9741-0f93f3b4f060 value: VM ] 2015-04-21 17:32:55,359 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-8) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type LiveMigrateVmDisks completed, handling the result. 2015-04-21 17:32:55,359 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-8) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type LiveMigrateVmDisks succeeded, clearing tasks. STARTED LIVE SNAPSHOT A FEW MINUTES DELETE AFTER UNBLOCKING THE IPTABLES-------------------------------------------- 2015-04-21 17:36:45,189 INFO [org.ovirt.engine.core.bll.RemoveMemoryVolumesCommand] (pool-6-thread-6) [48240de4] Running command: RemoveMemoryVolumesCommand internal: true. 2015-04-21 17:36:45,214 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-5) Correlation ID: 3d2dda02, Job ID: 8ec1ea49-4695-4d05-af0a-6c453dc5ac10, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'snapp2' deletion for VM 'vm16' was initiated by admin@internal. 2015-04-21 17:36:45,220 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-6) [48240de4] START, DeleteImageGroupVDSCommand( storagePoolId = 6d96f52d-d791-4f66-83bd-2553ca0f3012, ignoreFailoverLimit = false, storageDomainId = e2b109f3-d36b-4696-9ee9-99e4fb7d1fd5, imageGroupId = 4692515d-f8f6-4218-a237-b7d90dc96e71, postZeros = false, forceDelete = false), log id: 48ce0bee 2015-04-21 17:36:48,189 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-6) [48240de4] FINISH, DeleteImageGroupVDSCommand, log id: 48ce0bee 2015-04-21 17:36:48,195 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (pool-6-thread-6) [48240de4] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command 143bd8c8-ac95-4b4c-9078-fa34292eaf18 2015-04-21 17:36:48,195 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (pool-6-thread-6) [48240de4] CommandMultiAsyncTasks::AttachTask: Attaching task 9458d2bd-f5ac-4c2a-a0c2-52ab3fb2f65d to command 143bd8c8-ac95-4b4c-9078-fa34292eaf18. FAILURE OF LIVE SNAPSHOT DELETE------------------------------------------------------------------------------------------------------------- 2015-04-21 17:38:13,874 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-71) [9f60b4a] Snapshot 6b43e440-589f-4294-8732-08fb58a9f988 images e4e1d5af-65ef-4ed8- 9aba-95705976c367..1b115362-9a30-43ec-b06f-b25fa0052508 merged, but volume removal failed. Some or all of the following volumes may be orphaned: [e4e1d5af-65ef-4ed8-9aba-95705976c367]. Please retry Live Merge on the snapshot to complete the operation. 2015-04-21 17:38:13,880 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (DefaultQuartzScheduler_Worker-71) [15504788] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-21 17:38:14,780 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-69) Failed in HSMGetAllTasksStatusesVDS method 2015-04-21 17:38:14,781 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-69) Failed in HSMGetAllTasksStatusesVDS method 2015-04-21 17:38:14,781 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-69) BaseAsyncTask::logEndTaskFailure: Task 4f6ea81e-e40e-4927-97b6-0a6a65a590a9 (Parent Command RemoveS napshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanSuccess -- Message: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Could not acquire resource. Probably resource factory threw an exception.: (), code = 100, -- Exception: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Could not acquire resource. Probably resource factory threw an exception.: (), code = 100 2015-04-21 17:38:14,781 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-69) CommandAsyncTask::endActionIfNecessary: All tasks of command 34e07226-09d0-4b67-892b-d904f3815e fa has ended -> executing endAction 2015-04-21 17:38:14,781 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-69) CommandAsyncTask::endAction: Ending action for 1 tasks (command ID: 34e07226-09d0-4b67-892b-d90 4f3815efa): calling endAction . 2015-04-21 17:38:14,781 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-3) CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction RemoveSnaps hotSingleDiskLive, executionIndex: 0 2015-04-21 17:38:14,785 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-69) BaseAsyncTask::logEndTaskFailure: Task 69cafb68-360c-4ee3-9c1d-7a118d228c8e (Parent Command RemoveS napshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanSuccess -- Message: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Could not acquire resource. Probably resource factory threw an exception.: (), code = 100, -- Exception: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Could not acquire resource. Probably resource factory threw an exception.: (), code = 100 2015-04-21 17:38:14,786 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-69) CommandAsyncTask::endActionIfNecessary: All tasks of command 17b08dd0-24f0-493e-bb53-88db9a5f78 17 has ended -> executing endAction 2015-04-21 17:38:14,786 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-69) CommandAsyncTask::endAction: Ending action for 1 tasks (command ID: 17b08dd0-24f0-493e-bb53-88d b9a5f7817): calling endAction . 2015-04-21 17:38:14,786 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-27) CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction RemoveSnap shotSingleDiskLive, executionIndex: 0 2015-04-21 17:38:14,804 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-3) [9f60b4a] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSin gleDiskLiveCommand 2015-04-21 17:38:14,814 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-3) [15504788] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-21 17:38:14,815 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-3) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type RemoveSnapsh otSingleDiskLive completed, handling the result. 2015-04-21 17:38:14,816 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-3) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type RemoveSnapsh otSingleDiskLive hasnt succeeded, not clearing tasks, will attempt again next polling. 2015-04-21 17:38:14,819 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-27) [7bb45238] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotS ingleDiskLiveCommand 2015-04-21 17:38:14,823 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-27) [2f7596cf] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-21 17:38:14,824 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-27) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type RemoveSnaps hotSingleDiskLive completed, handling the result. 2015-04-21 17:38:14,824 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-27) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type RemoveSnaps hotSingleDiskLive hasnt succeeded, not clearing tasks, will attempt again next polling. 2015-04-21 17:38:23,891 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-87) [3d2dda02] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand 2015-04-21 17:38:23,911 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-87) [3d2dda02] Correlation ID: 3d2dda02, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snapp2' for VM 'vm16'. 2015-04-21 17:38:24,797 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-9) [232cf324] Failed in HSMGetAllTasksStatusesVDS method 2015-04-21 17:38:24,797 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-9) [232cf324] Failed in HSMGetAllTasksStatusesVDS method NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST LOGS!!!!!!!! =============================================================== ================================================================================================================== VDSM log Thread-118033::DEBUG::2015-04-21 20:30:34,925::task::595::Storage.TaskManager.Task::(_updateState) Task=`5ace41e8-feb4-43e0-8f47-5a5281001322`::moving from state init -> state preparing Thread-116487::ERROR::2015-04-21 20:33:05,069::misc::210::Storage.Misc::(readspeed) Unable to read file '/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata' Thread-116487::ERROR::2015-04-21 20:33:05,070::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain fd8aa916-0b31-4d2e-99ef-f218967e93e1 monitoring information Traceback (most recent call last): . . Thread-40::ERROR::2015-04-21 20:33:07,916::lvm::404::Storage.LVM::(_reloadvgs) vg 88eb14b4-a7fc-40ab-be6b-e2ab7bc31dbc has pv_count 3 but pv_names ('/dev/mapper/360060160f4a030002c5dc169973de411', '/dev/mapper/36006016012823000387933e01ab5e311') . . Thread-40::ERROR::2015-04-21 20:33:11,271::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 88eb14b4-a7fc-40ab-be6b-e2ab7bc31dbc monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 232, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/blockSD.py", line 831, in selftest lvm.chkVG(self.sdUUID) File "/usr/share/vdsm/storage/lvm.py", line 985, in chkVG raise se.StorageDomainAccessError("%s: %s" % (vgName, err)) . . 5ace41e8-feb4-43e0-8f47-5a5281001322::DEBUG::2015-04-21 20:33:11,705::lvm::514::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex 5ace41e8-feb4-43e0-8f47-5a5281001322::DEBUG::2015-04-21 20:33:11,705::lvm::526::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex 5ace41e8-feb4-43e0-8f47-5a5281001322::ERROR::2015-04-21 20:33:11,705::volume::473::Storage.Volume::(create) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/volume.py", line 435, in create srcVolUUID, volPath) File "/usr/share/vdsm/storage/blockVolume.py", line 159, in _create addTags=mdTags) File "/usr/share/vdsm/storage/lvm.py", line 1232, in changeLVTags (lvname, ", ".join(addTags), ", ".join(delTags), err[-1])) LogicalVolumeReplaceTagError: Replace Logical Volume tag error: (u'lv: `fd8aa916-0b31-4d2e-99ef-f218967e93e1/63335b3b-6d82-446c-bb16-4ed9a6b20207` add: `PU_c2a1254b-58dd-4d4c-ac0d-275093c99d05, MD_10, IU_1c9b5c2e-23ea-480c-b67b-50b969faf021` del: `OVIRT_VOL_INITIALIZING` ( Cannot process volume group fd8aa916-0b31-4d2e-99ef-f218967e93e1)',) . . 5ace41e8-feb4-43e0-8f47-5a5281001322::DEBUG::2015-04-21 20:33:11,725::resourceManager::649::Storage.ResourceManager::(releaseResource) No one is waiting for resource 'fd8aa916-0b31-4d2e-99ef-f218967e93e1_imageNS .1c9b5c2e-23ea-480c-b67b-50b969faf021', Clearing records. 5ace41e8-feb4-43e0-8f47-5a5281001322::ERROR::2015-04-21 20:33:11,725::task::866::Storage.TaskManager.Task::(_setError) Task=`5ace41e8-feb4-43e0-8f47-5a5281001322`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 334, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1821, in createVolume desc=desc, srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID) File "/usr/share/vdsm/storage/sd.py", line 430, in createVolume preallocate, diskType, volUUID, desc, srcImgUUID, srcVolUUID) File "/usr/share/vdsm/storage/volume.py", line 435, in create srcVolUUID, volPath) File "/usr/share/vdsm/storage/blockVolume.py", line 159, in _create addTags=mdTags) File "/usr/share/vdsm/storage/lvm.py", line 1232, in changeLVTags (lvname, ", ".join(addTags), ", ".join(delTags), err[-1])) . . ace41e8-feb4-43e0-8f47-5a5281001322::DEBUG::2015-04-21 20:33:11,725::resourceManager::649::Storage.ResourceManager::(releaseResource) No one is waiting for resource 'fd8aa916-0b31-4d2e-99ef-f218967e93e1_imageNS .1c9b5c2e-23ea-480c-b67b-50b969faf021', Clearing records. 5ace41e8-feb4-43e0-8f47-5a5281001322::ERROR::2015-04-21 20:33:11,725::task::866::Storage.TaskManager.Task::(_setError) Task=`5ace41e8-feb4-43e0-8f47-5a5281001322`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 334, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1821, in createVolume desc=desc, srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID) File "/usr/share/vdsm/storage/sd.py", line 430, in createVolume preallocate, diskType, volUUID, desc, srcImgUUID, srcVolUUID) File "/usr/share/vdsm/storage/volume.py", line 435, in create srcVolUUID, volPath) File "/usr/share/vdsm/storage/blockVolume.py", line 159, in _create addTags=mdTags) File "/usr/share/vdsm/storage/lvm.py", line 1232, in changeLVTags (lvname, ", ".join(addTags), ", ".join(delTags), err[-1])) . . 5ace41e8-feb4-43e0-8f47-5a5281001322::DEBUG::2015-04-21 20:33:11,811::blockVolume::594::Storage.Misc.excCmd::(getMetadata) FAILED: <err> = "/usr/bin/dd: error reading '/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/m etadata': Input/output error\n0+0 records in\n0+0 records out\n0 bytes (0 B) copied, 0.000234782 s, 0.0 kB/s\n"; <rc> = 1 5ace41e8-feb4-43e0-8f47-5a5281001322::ERROR::2015-04-21 20:33:11,811::blockVolume::605::Storage.Volume::(getMetadata) Internal block device read failure: 'name=/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata, offset=6144, size=512' Traceback (most recent call last): File "/usr/share/vdsm/storage/blockVolume.py", line 594, in getMetadata offs * VOLUME_METASIZE, VOLUME_METASIZE) File "/usr/share/vdsm/storage/misc.py", line 251, in readblock raise se.MiscBlockReadException(name, offset, size) MiscBlockReadException: Internal block device read failure: 'name=/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata, offset=6144, size=512' 5ace41e8-feb4-43e0-8f47-5a5281001322::ERROR::2015-04-21 20:33:11,812::task::866::Storage.TaskManager.Task::(_setError) Task=`5ace41e8-feb4-43e0-8f47-5a5281001322`::Unexpected error Traceback (most recent call last): . . 7e5db0a8-4886-4dad-bddb-57f04950240f::ERROR::2015-04-21 20:33:13,881::volume::473::Storage.Volume::(create) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/volume.py", line 435, in create srcVolUUID, volPath) File "/usr/share/vdsm/storage/blockVolume.py", line 159, in _create addTags=mdTags) File "/usr/share/vdsm/storage/lvm.py", line 1232, in changeLVTags (lvname, ", ".join(addTags), ", ".join(delTags), err[-1])) `. . 7e5db0a8-4886-4dad-bddb-57f04950240f::ERROR::2015-04-21 20:33:13,956::blockVolume::605::Storage.Volume::(getMetadata) Internal block device read failure: 'name=/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata, offset=6656, size=512' Traceback (most recent call last): File "/usr/share/vdsm/storage/blockVolume.py", line 594, in getMetadata offs * VOLUME_METASIZE, VOLUME_METASIZE) File "/usr/share/vdsm/storage/misc.py", line 251, in readblock raise se.MiscBlockReadException(name, offset, size) MiscBlockReadException: Internal block device read failure: 'name=/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata, offset=6656, size=512' 7e5db0a8-4886-4dad-bddb-57f04950240f::ERROR::2015-04-21 20:33:13,957::task::866::Storage.TaskManager.Task::(_setError) Task=`7e5db0a8-4886-4dad-bddb-57f04950240f`::Unexpected error Traceback (most recent call last): . . Thread-116487::ERROR::2015-04-21 20:33:15,088::misc::210::Storage.Misc::(readspeed) Unable to read file '/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata' Thread-116487::ERROR::2015-04-21 20:33:15,089::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain fd8aa916-0b31-4d2e-99ef-f218967e93e1 monitoring information Traceback (most recent call last): . . Thread-40::ERROR::2015-04-21 20:33:22,998::lvm::404::Storage.LVM::(_reloadvgs) vg 88eb14b4-a7fc-40ab-be6b-e2ab7bc31dbc has pv_count 3 but pv_names ('/dev/mapper/360060160f4a030002c5dc169973de411', '/dev/mapper/36006016012823000387933e01ab5e311') Thread-40::DEBUG::2015-04-21 20:33:22,998::lvm::416::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-40::ERROR::2015-04-21 20:33:22,999::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 88eb14b4-a7fc-40ab-be6b-e2ab7bc31dbc monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 232, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/blockSD.py", line 833, in selftest raise se.StorageDomainAccessError(self.sdUUID) StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('88eb14b4-a7fc-40ab-be6b-e2ab7bc31dbc',) Dummy-102::DEBUG::2015-04-21 20:33:23,831::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/6d96f52d-d791-4f66-83bd-2553ca0f3012/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-102::DEBUG::2015-04-21 20:33:23,855::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.013281 s, 77.1 MB/s\n'; <rc> = 0 JsonRpc (StompReactor)::DEBUG::2015-04-21 20:33:24,240::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command='SEND'> JsonRpcServer::DEBUG::2015-04-21 20:33:24,241::__init__::506::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request Thread-118153::DEBUG::2015-04-21 20:33:24,244::stompReactor::163::yajsonrpc.StompServer::(send) Sending response Thread-116487::DEBUG::2015-04-21 20:33:25,092::blockSD::596::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) Thread-116487::DEBUG::2015-04-21 20:33:25,104::blockSD::596::Storage.Misc.excCmd::(getReadDelay) FAILED: <err> = "/usr/bin/dd: error reading '/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata': Input/output error\n0+0 records in\n0+0 records out\n0 bytes (0 B) copied, 0.000246275 s, 0.0 kB/s\n"; <rc> = 1 Thread-116487::ERROR::2015-04-21 20:33:25,105::misc::210::Storage.Misc::(readspeed) Unable to read file '/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata' Thread-116487::ERROR::2015-04-21 20:33:25,105::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain fd8aa916-0b31-4d2e-99ef-f218967e93e1 monitoring information Traceback (most recent call last): Thread-118150::DEBUG::2015-04-21 20:33:18,052::taskManager::171::Storage.TaskManager::(clearTask) Entry. taskID: 5ace41e8-feb4-43e0-8f47-5a5281001322
Created attachment 1017453 [details] engine, server, vdsm and libvirt logs Adding logs NOTE THAT THERE IS A 3 HOUR TIME DIFFERENCE BETWEEN THE ENGINE AND HOST LOGS
Adam, can you take a look please?
Ok, I think I finally understand what's happening in this bug: During live storage migration, we create an automatic snapshot to reduce the amount of data that the qemu process must synchronize. Thread-118033::DEBUG::2015-04-21 20:30:34,918::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Volume.create' in bridge with {u'preallocate': 2, u'volFormat': 4, u'srcImgUUID': u'1c9b5c2e-23ea-480c-b67b-50b969faf021', u'volumeID': u'63335b3b-6d82-446c-bb16-4ed9a6b20207', u'imageID': u'1c9b5c2e-23ea-480c-b67b-50b969faf021', u'storagepoolID': u'6d96f52d-d791-4f66-83bd-2553ca0f3012', u'storagedomainID': u'fd8aa916-0b31-4d2e-99ef-f218967e93e1', u'desc': u'', u'diskType': 2, u'srcVolUUID': u'c2a1254b-58dd-4d4c-ac0d-275093c99d05', u'size': u'3221225472'} Part of creating the snapshot volume is to mark the old leaf as internal: See blockVolume.BlockVolume._create() volParent.clone(volPath, volFormat) self.setInternal() Some time later we experience a failure to access storage 5ace41e8-feb4-43e0-8f47-5a5281001322::ERROR::2015-04-21 20:33:11,725::task::866::Storage.TaskManager.Task::(_setError) Task=`5ace41e8-feb4-43e0-8f47-5a5281001322`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 334, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1821, in createVolume desc=desc, srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID) File "/usr/share/vdsm/storage/sd.py", line 430, in createVolume preallocate, diskType, volUUID, desc, srcImgUUID, srcVolUUID) File "/usr/share/vdsm/storage/volume.py", line 435, in create srcVolUUID, volPath) File "/usr/share/vdsm/storage/blockVolume.py", line 159, in _create addTags=mdTags) File "/usr/share/vdsm/storage/lvm.py", line 1232, in changeLVTags (lvname, ", ".join(addTags), ", ".join(delTags), err[-1])) LogicalVolumeReplaceTagError: Replace Logical Volume tag error: (u'lv: `fd8aa916-0b31-4d2e-99ef-f218967e93e1/63335b3b-6d82-446c-bb16-4ed9a6b20207` add: `PU_c2a1254b-58dd-4d4c-ac0d-275093c99d05, MD_10, IU_1c9b5c2e-23ea-480c-b67b-50b969faf021` del: `OVIRT_VOL_INITIALIZING` ( Cannot process volume group fd8aa916-0b31-4d2e-99ef-f218967e93e1)',) This causes the createVolume task to fail and start recovery 5ace41e8-feb4-43e0-8f47-5a5281001322::DEBUG::2015-04-21 20:33:11,725::task::885::Storage.TaskManager.Task::(_run) Task=`5ace41e8-feb4-43e0-8f47-5a5281001322`::Task._run: 5ace41e8-feb4-43e0-8f47-5a5281001322 () {} failed - stopping task Part of recovery is supposed to fix the parent volume (making it a leaf again) See parentVolumeRollback() But since storage cannot be accessed, recovery fails 5ace41e8-feb4-43e0-8f47-5a5281001322::ERROR::2015-04-21 20:33:11,812::task::866::Storage.TaskManager.Task::(_setError) Task=`5ace41e8-feb4-43e0-8f47-5a5281001322`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 403, in run return function(self.task, *argslist) File "/usr/share/vdsm/storage/volume.py", line 317, in parentVolumeRollback pvol.prepare() File "/usr/share/vdsm/storage/volume.py", line 764, in prepare if not self.isLegal(): File "/usr/share/vdsm/storage/volume.py", line 700, in isLegal legality = self.getMetaParam(LEGALITY) File "/usr/share/vdsm/storage/volume.py", line 918, in getMetaParam meta = self.getMetadata() File "/usr/share/vdsm/storage/blockVolume.py", line 606, in getMetadata raise se.VolumeMetadataReadError("%s: %s" % (metaId, e)) VolumeMetadataReadError: Error while processing volume meta data: ("('fd8aa916-0b31-4d2e-99ef-f218967e93e1', 12): Internal block device read failure: 'name=/dev/fd8aa916-0b31-4d2e-99ef-f218967e93e1/metadata, offset=6144, size=512'",) The image is now left without a leaf volume in the chain. Later we attempt a live merge. Thread-118301::DEBUG::2015-04-21 20:37:13,728::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'cf807726-3843-47e1-bf5b-ff0e81d51519', u'vmID': u'64b9a769-753a-4a51-9741-0f93f3b4f060', u'drive': {u'domainID': u'fd8aa916-0b31-4d2e-99ef-f218967e93e1', u'volumeID': u'c2a1254b-58dd-4d4c-ac0d-275093c99d05', u'poolID': u'6d96f52d-d791-4f66-83bd-2553ca0f3012', u'imageID': u'1c9b5c2e-23ea-480c-b67b-50b969faf021'}, u'bandwidth': u'0', u'jobUUID': u'76c42850-9bad-4bfc-8867-e33b921c9995', u'baseVolUUID': u'2c6abcc8-4a7f-4f7c-bdfa-c1d1430f7a56'} The merge starts and libvirt finishes all of it's work and it's time to synchronize the new chain to storage. See LiveMergeCleanupThread and imageSyncVolumeChain But this fails because we cannot lock the image: Thread-118305::ERROR::2015-04-21 20:37:15,505::task::866::Storage.TaskManager.Task::(_setError) Task=`ade72a05-082e-462b-b3d8-d27127d57604`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1839, in imageSyncVolumeChain rm.LockType.shared): File "/usr/share/vdsm/storage/resourceManager.py", line 523, in acquireResource raise se.ResourceAcqusitionFailed() ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: () Which happens because Image.getChain() cannot find the leaf volume. Thread-118305::ERROR::2015-04-21 20:37:15,476::image::201::Storage.Image::(getChain) There is no leaf in the image 1c9b5c2e-23ea-480c-b67b-50b969faf021 Thread-118305::WARNING::2015-04-21 20:37:15,478::resourceManager::591::Storage.ResourceManager::(registerResource) Resource factory failed to create resource 'fd8aa916-0b31-4d2e-99ef-f218967e93e1_imageNS.1c9b5c2e-23ea-480c-b67b-50b969faf021'. Canceling request. Traceback (most recent call last): File "/usr/share/vdsm/storage/resourceManager.py", line 587, in registerResource obj = namespaceObj.factory.createResource(name, lockType) File "/usr/share/vdsm/storage/resourceFactories.py", line 193, in createResource lockType) File "/usr/share/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList imgUUID=resourceName) File "/usr/share/vdsm/storage/image.py", line 202, in getChain raise se.ImageIsNotLegalChain(imgUUID) ImageIsNotLegalChain: Image is not a legal chain: (u'1c9b5c2e-23ea-480c-b67b-50b969faf021',) The merge job ends without updating anything and engine treats this as a live merge failure. Sadly, the merge actually did complete according to libvirt and qemu but we are now out of sync.
I'd like to note that using a garbage collection approach instead of rollbacks would resolve this problem since the bad state would be cleaned up as soon as storage comes back online. We will have this in place as part of my SPM removal work :) I'm not sure if there's a good way to fix this problem in 3.5.x. It would be far too invasive to rerun recoveries when they fail with exceptions.
The patch at http://gerrit.ovirt.org/45387 makes sure we handle errors when running imageSyncVolumeChain. This will correct some of the problematic behavior by not allowing the live merge job to resolve until imageSyncVolumeChain completes successfully. It will not prevent the image from being corrupted by the interrupted createVolume operation though.
As noted in comment 3, the issue here isn't LSM, it's the failure in taking a snapshot. While the reported scenario indeed involves LSM, it may happen with any failed attempt to create a snapshot (regardless of its "liveliness"), followed by a live merge. Setting priority to medium as this is an edge case of an edge case. Adam - could you please provide a manual procedure to recover from such a corruption?
There are several recovery actions which may or may not have been executed to the only real way to recover from this is to examine logs and the actual storage. For this very specific case the procedure is: 1. Identify the current leaf volume. It will be the same volume that was the leaf prior to creating the failed snapshot. 2. Modify the volume metadata on storage changing VOLTYPE=INTERNAL to VOLTYPE=LEAF For attached and connected file-based domains, the metadata is located in /rhev/data-center/<POOL-ID>/<DOMAIN-ID>/images/<IMAGE-ID>/<VOLUME-ID>.meta For block domains, I don't believe we have a tool to assist with modifying the metadata at this time.
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015. Please review this bug and if not a blocker, please postpone to a later release. All bugs not postponed on GA release will be automatically re-targeted to - 3.6.1 if severity >= high - 4.0 if severity < high
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
oVirt 4.0 beta has been released, moving to RC milestone.
Should this be on MODIFIED or ON_QA?
Neither; I still didn't investigate
As explained in comments 3 and 7, this is a double (if not triple) negative flow, and we don't really have a good solutions for it on top of the patches already provided.