Description of problem: VM disks become illegal after failed Live Merge Version-Release number of selected component (if applicable): ver 3.5.1 vt14.3 rhevm-3.5.1-0.4.el6ev.noarch vdsm-4.16.13.1-1.el7ev.x86_64 How reproducible: Steps to Reproduce: 1. Create a VM with several disks including block preallocated and thin and nfs preallocated and thin 2. Started the VM 3. Created 3 snapshots snsa1, snsa2, snsa3 4. Deleted snapshot snsa2 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation failed BUT 2 of the block disks vm11_Disk3 - preallocated and vm11_Disk3 - Thin are now illegal. All further attempts to deleted the snapshot will fail Actual results: 2 of the block disks vm11_Disk3 - preallocated and vm11_Disk3 - Thin are now illegal after a failed Live Snapshot Delete merge operation Expected results: All disks should be in legal state after a failed Live Snapshot Delete Merge operation Additional info: THERE IS A 3 HOUR TIME DIFFERENCE BETWEEN THE ENGINE AND THE HOST ENGINE log ----------------------- 2015-04-19 16:00:13,498 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-8) Correlation ID: 4684f0de, Job ID: 2c2cac3f-6ce6-4c54-b86c-24d910e519bb, Call Stack: null, Custom Event ID: -1, M essage: Snapshot 'snsa2' deletion for VM 'vm_11' was initiated by admin@internal. 2015-04-19 16:01:21,236 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages 2015-04-19 16:04:13,488 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-69) [1d8798ed] Command SpmStatusVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769, storagePoolId = 6d96f52d-d791-4f66-83bd-2553ca0f3012) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2015-04-19 16:04:13,625 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand] (DefaultQuartzScheduler_Worker-69) [6497d7ed] Failed in GetStoragePoolInfoVDS method 2015-04-19 16:04:13,626 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-69) [6497d7ed] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to GetStoragePoolInfoVDS, error = Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',), code = 309 2015-04-19 16:04:13,639 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (pool-6-thread-8) [4033c4ff] Failed in DestroyImageVDS method 2015-04-19 16:04:13,639 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-8) [4033c4ff] IrsBroker::Failed::DestroyImageVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to DestroyImageVDS, error = Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',), code = 309 2015-04-19 16:04:13,648 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (pool-6-thread-8) [4033c4ff] Command org.ovirt.engine.core.bll.DestroyImageCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to DestroyImageVDS, error = Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',), code = 309 (Failed with error StoragePoolUnknown and code 309) 2015-04-19 16:04:13,672 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (pool-6-thread-8) [4033c4ff] Transaction rolled-back for command: org.ovirt.engine.core.bll.DestroyImageCommand. 2015-04-19 16:04:13,776 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-85) [5656cb99] Command HSMGetAllTasksStatusesVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: () 2015-04-19 16:04:13,987 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-96) [6925c3dc] Command ListVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769, vds=Host[nott-vds1,eaba7988-e524-4c19-bd7b-8f9ba0bbc769]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2015-04-19 16:04:14,003 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-96) [6925c3dc] Failure to refresh Vds runtime info: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2015-04-19 16:04:21,016 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-89) [7d2f99b] Unknown child command status for step DESTROY_IMAGE 2015-04-19 16:04:23,690 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-16) [7b969bf1] Command HSMGetAllTasksStatusesVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM 2015-04-19 16:04:28,536 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [18ed9a86] BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:28,544 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-8) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:28,546 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [18ed9a86] BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:28,551 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-23) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:28,574 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-8) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:28,584 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-23) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:31,050 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [723a469e] Failed child command status for step DESTROY_IMAGE 2015-04-19 16:04:31,854 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [7d2f99b] Snapshot 314e7dbc-9784-43b4-907c-7fda03e3463a images dc6ac4be-a71d-4822-bd79-8522abdee74b..16565bda-e777-4362-9bfb-81fea9ad0ca3 merged, but volume removal failed. Some or all of the following volumes may be orphaned: [dc6ac4be-a71d-4822-bd79-8522abdee74b]. Please retry Live Merge on the snapshot to complete the operation. 2015-04-19 16:04:31,906 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [624cd6d5] Failed child command status for step DESTROY_IMAGE 2015-04-19 16:04:38,635 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-21) [6740fd3e] BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:38,636 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-21) [6740fd3e] BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:38,651 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-5) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:38,654 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-39) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:38,669 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-5) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:38,670 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-39) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:42,851 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-75) [723a469e] Snapshot 314e7dbc-9784-43b4-907c-7fda03e3463a images 69c6b3b2-38f7-4b2f-9170-b9f1f46a796d..09e9930a-41b9-4eff-a31c-5b7b590bf63b merged, but volume removal failed. Some or all of the following volumes may be orphaned: [69c6b3b2-38f7-4b2f-9170-b9f1f46a796d]. Please retry Live Merge on the snapshot to complete the operation. 2015-04-19 16:04:42,862 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (DefaultQuartzScheduler_Worker-75) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:44,288 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-75) [4684f0de] All Live Merge child commands have completed, status FAILED 2015-04-19 16:04:46,002 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-75) [624cd6d5] Snapshot 314e7dbc-9784-43b4-907c-7fda03e3463a images 9f9f9f3c-37ac-4af6-9090-b8121f7028b0..81a8716d-713c-42f8-aa9c-7bd1692cd4ed merged, but volume removal failed. Some or all of the following volumes may be orphaned: [9f9f9f3c-37ac-4af6-9090-b8121f7028b0]. Please retry Live Merge on the snapshot to complete the operation. 2015-04-19 16:04:46,006 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (DefaultQuartzScheduler_Worker-75) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:48,864 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-43) BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:48,864 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-43) BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:48,876 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-25) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:48,887 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-33) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:48,892 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-25) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:48,894 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-33) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand 2015-04-19 16:04:56,019 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-57) [4684f0de] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand 2015-04-19 16:04:56,039 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-57) [4684f0de] Correlation ID: 4684f0de, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snsa2' for VM 'vm_11'. 2015-04-19 16:04:59,313 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-17) [56cbba47] BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:59,313 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-17) [56cbba47] BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: 2015-04-19 16:04:59,320 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-22) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:59,339 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-12) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand 2015-04-19 16:04:56,039 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-57) [4684f0de] Correlation ID: 4684f0de, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snsa2' for VM 'vm_11'. VDSM log ----------------------------- cbfe5fbe-9f6a-441f-bb5b-4bac3de41fdc::INFO::2015-04-19 19:01:20,332::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=4a31b260-8136-411d-9743-b3e937859097 volUUID = 54c94300-650a-4da9-9656-87969a6c60a4 legality = ILLEGAL 0bb73a86-25be-42d5-9695-3b6bf1411d5b::INFO::2015-04-19 19:01:24,958::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=32d03b22-d9a8-40e8-a7f6-fae12cdbb840 volUUID = 2655ffd3-5c64-4cd4-8aac-16599acae719 legality = ILLEGAL e4ece9d5-2b01-4fda-b1af-2a6e0a766bed::INFO::2015-04-19 19:01:30,522::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=569d7156-a956-4e56-afe2-a744bcf07260 volUUID = 43c81e12-7987-4d5d-88a7-8e5dae69fb3c legality = ILLEGAL 87012cd5-4bdf-430f-a514-9d1ef3550adb::INFO::2015-04-19 19:01:32,926::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=de1e8749-c8df-45ba-b0a9-c40331f8d8ba volUUID = 69c6b3b2-38f7-4b2f-9170-b9f1f46a796d legality = ILLEGAL 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f::INFO::2015-04-19 19:01:33,175::volume::617::Storage.Volume::(setLegality) sdUUID=3c9387ca-e89f-48c5-9dba-824639e604af imgUUID=36a57edb-ea1c-40d4-be76-8e4018513042 volUUID = 9f9f9f3c-37ac-4af6-9090-b8121f7028b0 legality = ILLEGAL 0e13510c-2932-4ad5-b2d8-f0c0daa379b2::INFO::2015-04-19 19:15:16,359::volume::617::Storage.Volume::(setLegality) sdUUID=6210d179-81a4-4fd0-bb09-350238520812 imgUUID=12ec201c-f95d-4909-a9a3-b75688d219c3 volUUID = dc6ac4be-a71d-4822-bd79-8522abdee74b legality = ILLEGAL ioprocess communication (17553)::ERROR::2015-04-19 19:01:30,993::__init__::152::IOProcessClient::(_communicate) IOProcess failure ioprocess communication (23042)::ERROR::2015-04-19 19:01:30,995::__init__::152::IOProcessClient::(_communicate) IOProcess failure ioprocess communication (22967)::ERROR::2015-04-19 19:01:30,994::__init__::152::IOProcessClient::(_communicate) IOProcess failure ioprocess communication (20822)::ERROR::2015-04-19 19:01:30,993::__init__::152::IOProcessClient::(_communicate) IOProcess failure ioprocess communication (21250)::ERROR::2015-04-19 19:01:30,995::__init__::152::IOProcessClient::(_communicate) IOProcess failure Thread-44::ERROR::2015-04-19 19:01:36,904::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 27d6175c-6a75-4eb6-910f-e9a158e63121 monitoring information storageRefresh::ERROR::2015-04-19 19:01:47,635::lvm::336::Storage.LVM::(_reloadpvs) Missing pv: unRCWJ-idrB-HHdD-9iXo-stpp-Yqyt-0Xrpwv in vg: 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 storageRefresh::ERROR::2015-04-19 19:01:47,635::lvm::336::Storage.LVM::(_reloadpvs) Missing pv: 3cEMHe-TWUD-iZJ2-pCYo-rEn5-EgSa-DMeSRZ in vg: 4bc98a03-c7c7-4701-b055-94f777670509 storageRefresh::ERROR::2015-04-19 19:01:47,635::lvm::336::Storage.LVM::(_reloadpvs) Missing pv: cP95qE-mfrM-qHgQ-OAH8-lHBe-fV6Q-SgTU3s in vg: 4bc98a03-c7c7-4701-b055-94f777670509 storageRefresh::ERROR::2015-04-19 19:01:50,009::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') storageRefresh::ERROR::2015-04-19 19:01:50,009::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) Thread-13::ERROR::2015-04-19 19:01:50,017::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 3eae5544-0c43-483b-bf06-118660fcbfb1 Thread-13::ERROR::2015-04-19 19:01:50,018::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 3eae5544-0c43-483b-bf06-118660fcbfb1 Thread-13::ERROR::2015-04-19 19:01:50,908::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 3c9387ca-e89f-48c5-9dba-824639e604af Thread-13::ERROR::2015-04-19 19:01:50,908::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 3c9387ca-e89f-48c5-9dba-824639e604af Thread-13::ERROR::2015-04-19 19:01:51,747::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 Thread-13::ERROR::2015-04-19 19:01:51,747::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 Thread-13::ERROR::2015-04-19 19:01:52,487::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 6210d179-81a4-4fd0-bb09-350238520812 Thread-13::ERROR::2015-04-19 19:01:52,487::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 6210d179-81a4-4fd0-bb09-350238520812 storageRefresh::ERROR::2015-04-19 19:01:55,178::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') storageRefresh::ERROR::2015-04-19 19:01:55,178::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) Thread-24::ERROR::2015-04-19 19:04:33,400::task::866::Storage.TaskManager.Task::(_setError) Task=`7c53834e-55ec-4ff6-ab1f-75525e274f30`::Unexpected error Thread-24::ERROR::2015-04-19 19:04:33,405::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',)", 'code': 309}} Thread-25::ERROR::2015-04-19 19:04:33,416::task::866::Storage.TaskManager.Task::(_setError) Task=`da0969e6-651b-4c72-b6b5-2023194ecaaa`::Unexpected error Thread-25::ERROR::2015-04-19 19:04:33,419::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',)", 'code': 309}} Thread-26::ERROR::2015-04-19 19:04:33,553::task::866::Storage.TaskManager.Task::(_setError) Task=`7e8fb4f9-78da-4e4a-95d8-8e421272f29b`::Unexpected error Thread-26::ERROR::2015-04-19 19:04:33,555::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Not SPM: ()', 'code': 654}} Thread-81::ERROR::2015-04-19 19:04:43,466::task::866::Storage.TaskManager.Task::(_setError) Task=`0f553806-cdef-4414-b0f1-fc249955bec5`::Unexpected error Thread-81::ERROR::2015-04-19 19:04:43,468::dispatcher::79::Storage.Dispatcher::(wrapper) Secured object is not in safe state Thread-108::ERROR::2015-04-19 19:04:50,712::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') Thread-108::ERROR::2015-04-19 19:04:50,713::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) Thread-107::ERROR::2015-04-19 19:04:50,716::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') Thread-107::ERROR::2015-04-19 19:04:50,716::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) Thread-109::ERROR::2015-04-19 19:04:50,721::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') Thread-109::ERROR::2015-04-19 19:04:50,721::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) Thread-110::ERROR::2015-04-19 19:04:50,776::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') Thread-110::ERROR::2015-04-19 19:04:50,776::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) Thread-111::ERROR::2015-04-19 19:04:50,871::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') Thread-111::ERROR::2015-04-19 19:04:50,872::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) Thread-112::ERROR::2015-04-19 19:04:50,984::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411') Thread-112::ERROR::2015-04-19 19:04:50,984::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',) 25c3fa2c-8b8d-4a02-9536-7411ea52a192::ERROR::2015-04-19 19:15:18,442::task::866::Storage.TaskManager.Task::(_setError) Task=`25c3fa2c-8b8d-4a02-9536-7411ea52a192`::Unexpected error 3783d493-afe6-468c-b4df-185283ee48ab::ERROR::2015-04-19 19:15:18,497::task::866::Storage.TaskManager.Task::(_setError) Task=`3783d493-afe6-468c-b4df-185283ee48ab`::Unexpected error
Libvirt rpms on host -------------------------------- libvirt-client-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.2.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.2.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.2.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.2.x86_64
Created attachment 1016104 [details] engine, server, vdsm and libvirt logs Added logs NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g. Current Engine time: Sun Apr 19 18:18:05 IDT 2015 Current Host time: Sun Apr 19 21:18:35 IDT 2015
The error seems unrelated to the restart (missing PVs?), so I'm not quite sure what I'd want the disk status to be in... Adam/Greg, please take a look. Kevin - assuming the domain is still functioning (which doesn't seem necessary from the error, if you manually update the DB to mark those disks as OK [1], can you run the VM? [1] in psql, use: UPDATE images SET imagestatus = 0 WHERE image_group_id = <the disk's ID>; COMMIT;
Just to verify, the illegal state referred to here is illegal on the host? (I don't see any references to illegal disks in the supplied engine log.)
(In reply to Greg Padgett from comment #4) > Just to verify, the illegal state referred to here is illegal on the host? > (I don't see any references to illegal disks in the supplied engine log.) Then again, the vdsm log has no errors at all. Kevin, are you sure these are the right logs?
(In reply to Greg Padgett from comment #4) > Just to verify, the illegal state referred to here is illegal on the host? > (I don't see any references to illegal disks in the supplied engine log.) (In reply to Greg Padgett from comment #4) > Just to verify, the illegal state referred to here is illegal on the host? > (I don't see any references to illegal disks in the supplied engine log.) 1. NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g. Current Engine time: Sun Apr 19 18:18:05 IDT 2015 Current Host time: Sun Apr 19 21:18:35 IDT 2015 2. ILLEGAL status appears in the host vdsm log e.g. cbfe5fbe-9f6a-441f-bb5b-4bac3de41fdc::INFO::2015-04-19 19:01:20,332::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=4a31b260-8136-411d-9743-b3e937859097 volUUID = 54c94300-650a-4da9-9656-87969a6c60a4 legality = ILLEGAL 0bb73a86-25be-42d5-9695-3b6bf1411d5b::INFO::2015-04-19 19:01:24,958::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=32d03b22-d9a8-40e8-a7f6-fae12cdbb840 volUUID = 2655ffd3-5c64-4cd4-8aac-16599acae719 legality = ILLEGAL e4ece9d5-2b01-4fda-b1af-2a6e0a766bed::INFO::2015-04-19 3. In the GUI we see the ILLEGAL status displayed in the Storage > <Storage Domain> > Disk Snapshots > Status > illegal Added screen shot 4. 2 of the disk snapshots are displayed as Illegal since the failed Live Delete Snapshot operation. Subsequently I can no longer delete this snapshot
(In reply to Allon Mureinik from comment #5) > (In reply to Greg Padgett from comment #4) > > Just to verify, the illegal state referred to here is illegal on the host? > > (I don't see any references to illegal disks in the supplied engine log.) > Then again, the vdsm log has no errors at all. > Kevin, are you sure these are the right logs? Appologies, the vdsm log rotated before I loaded them. Adding logs again
Created attachment 1016393 [details] engine, server, vdsm and libvirt logs Added logs again as the vdsm log was old due to log rotation
Edited the summary of the bz to reflect the problem
The included vdsm log does not show the actual merge operation so it's hard to tell if there were any specific errors with that flow. What I see here are some errors from LVM indicating missing PVs. It seems like there could have been some external corruption of the storage domain (perhaps someone removed some LUNs on the server side?
(In reply to Kevin Alon Goldblatt from comment #6) > 4. 2 of the disk snapshots are displayed as Illegal since the failed Live > Delete Snapshot operation. Subsequently I can no longer delete this snapshot Just to clarify - it's just the snapshot that's marked illegal? The Disk object is OK? Can you stop the VM and re-run it?
(In reply to Allon Mureinik from comment #11) > (In reply to Kevin Alon Goldblatt from comment #6) > > 4. 2 of the disk snapshots are displayed as Illegal since the failed Live > > Delete Snapshot operation. Subsequently I can no longer delete this snapshot > > Just to clarify - it's just the snapshot that's marked illegal? > The Disk object is OK? > Can you stop the VM and re-run it? Yes I can stop and re-run the VM, But any attempt to delete the same snapshot again fails.
(In reply to Adam Litke from comment #10) > The included vdsm log does not show the actual merge operation so it's hard > to tell if there were any specific errors with that flow. What I see here > are some errors from LVM indicating missing PVs. It seems like there could > have been some external corruption of the storage domain (perhaps someone > removed some LUNs on the server side? No LUNS are missing on the storage.
Created attachment 1016713 [details] engine, server, vdsm and libvirt logs Added an additiona vdsm log that includes the earlier time from before the merge vdsm.log.22.xz
Upon further investigation I found that the illegal disk is displayed in the Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This means that the database has 2 tables that are not synchronized
(In reply to Kevin Alon Goldblatt from comment #15) > Upon further investigation I found that the illegal disk is displayed in the > Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This > means that the database has 2 tables that are not synchronized No, it doesn't. It means the disk as a whole is OK (i.e., can be used for VM operations like running it), but there's a problem somewhere down the chain which prevents things like merging and requires some intervention.
(In reply to Kevin Alon Goldblatt from comment #15) > Upon further investigation I found that the illegal disk is displayed in the > Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. <snip> Okay, I think I see what's going on. The reboot happened during the destroy_image phase of the merge attempts and thus the merges were successful but the snapshot volumes are still on the storage. We see they're illegal, as they should be; however, it's designed so that you can retry the merge on the illegal volumes in order to clean them up. This last bit isn't working for some reason or another. I'm looking into it.
After further testing, it appears the live merge could not be retried because the command would not converge after failure to delete the image. [1] With the linked patch, the failure causes overall command failure, leaving the snapshot and image unlocked and in the proper state to allow a retry. There may still be a failure due to the timing of the restart, but note that live merge isn't designed to necessarily succeed if restarted at any point--but rather to recover gracefully if re-executed after a failure such as this. Please test with this patch once it's in the build and let me know if you still see any problems. [1] In this case, DestroyImageCommand failed, which is the third part (of 3) of the live merge flow on the engine. This child command is an old-style command (runs an SPM task) whereas its parent which controls the flow uses newer infrastructure (CommandCoordinator). The two don't always play well together, and we had already added a special case to get the endAction flow working between the two. This was extended further to avoid an endless loop of status-checking if the child command fails.
Found some issues later in the retry flow, moving back to assigned.
A partial fix is posted to gerrit. Looks like there's still a case where the engine can "forget" that it perfomed a merge, which will cause an error when it retries during command recovery.
(In reply to Greg Padgett from comment #20) > A partial fix is posted to gerrit. Looks like there's still a case where > the engine can "forget" that it perfomed a merge, which will cause an error > when it retries during command recovery. So let's clone the bug for better tracking. We'll have this bug for 3.5.4 to track the fix we DO have, and use the other one to track the remaining problem.
(In reply to Allon Mureinik from comment #21) > (In reply to Greg Padgett from comment #20) > > A partial fix is posted to gerrit. Looks like there's still a case where > > the engine can "forget" that it perfomed a merge, which will cause an error > > when it retries during command recovery. > So let's clone the bug for better tracking. > We'll have this bug for 3.5.4 to track the fix we DO have, and use the other > one to track the remaining problem. Opened bug 1227497 to track that issue.
Verified using Version: -------------------------------------------------------- vdsm-4.17.0-1054.git562e711.el7.noarchrpm -qa ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch Verified with the following scenario: ------------------------------------------------------- Steps to Reproduce: 1. Create a VM with several disks including block preallocated and thin and nfs preallocated and thin 2. Started the VM 3. Created 3 snapshots snsa1, snsa2, snsa3 4. Deleted snapshot snsa2 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation was SUCCESSFUL 5. Deleted snapshot snsa1 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation was SUCCESSFUL Verified using a nested host! Moving to VERIFIED!
Hi, Is it possible to port this fix back into RHEV 3.5 please ? This is the version we've integrated against and would really like to see a fix for that version. -Amit Mitkar
(In reply to dev-unix-virtualization from comment #25) > Hi, > > Is it possible to port this fix back into RHEV 3.5 please ? This is the > version we've integrated against and would really like to see a fix for that > version. > > -Amit Mitkar It was included in 3.5.4.
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE