Bug 1213157

Summary: VM block SNAPSHOT disks become illegal after failed Live Delete Snapshot Merge
Product: Red Hat Enterprise Virtualization Manager Reporter: Kevin Alon Goldblatt <kgoldbla>
Component: ovirt-engineAssignee: Greg Padgett <gpadgett>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.1CC: ahino, alitke, amureini, dev-unix-virtualization, gklein, gpadgett, kgoldbla, lpeer, lsurette, ratamir, rbalakri, Rhev-m-bugs, tnisan, v.astafiev, yeylon, ykaul, ylavi
Target Milestone: ovirt-3.6.0-rcKeywords: ZStream
Target Release: 3.6.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-3.6.0_alpha1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1227497 1231535 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1231535    
Attachments:
Description Flags
engine, server, vdsm and libvirt logs
none
engine, server, vdsm and libvirt logs
none
engine, server, vdsm and libvirt logs none

Description Kevin Alon Goldblatt 2015-04-19 15:00:57 UTC
Description of problem:
VM disks become illegal after failed Live Merge

Version-Release number of selected component (if applicable):
ver 3.5.1 vt14.3
rhevm-3.5.1-0.4.el6ev.noarch
vdsm-4.16.13.1-1.el7ev.x86_64

How reproducible:


Steps to Reproduce:
1. Create a VM with several disks including block preallocated and thin and nfs preallocated and thin
2. Started the VM 
3. Created 3 snapshots snsa1, snsa2, snsa3
4. Deleted snapshot snsa2 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation failed BUT 2 of the block disks vm11_Disk3 - preallocated and vm11_Disk3 - Thin are now illegal. All further attempts to deleted the snapshot will fail

Actual results:
2 of the block disks vm11_Disk3 - preallocated and vm11_Disk3 - Thin are now illegal after a failed Live Snapshot Delete merge operation

Expected results:
All disks should be in legal state after a failed Live Snapshot Delete Merge operation

Additional info:

THERE IS A 3 HOUR TIME DIFFERENCE BETWEEN THE ENGINE AND THE HOST

ENGINE log
-----------------------
2015-04-19 16:00:13,498 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-8) Correlation ID: 4684f0de, Job ID: 2c2cac3f-6ce6-4c54-b86c-24d910e519bb, Call Stack: null, Custom Event ID: -1, M
essage: Snapshot 'snsa2' deletion for VM 'vm_11' was initiated by admin@internal.

2015-04-19 16:01:21,236 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages
2015-04-19 16:04:13,488 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-69) [1d8798ed] Command SpmStatusVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769, storagePoolId = 6d96f52d-d791-4f66-83bd-2553ca0f3012) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2015-04-19 16:04:13,625 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand] (DefaultQuartzScheduler_Worker-69) [6497d7ed] Failed in GetStoragePoolInfoVDS method
2015-04-19 16:04:13,626 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-69) [6497d7ed] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to GetStoragePoolInfoVDS, error = Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',), code = 309
2015-04-19 16:04:13,639 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (pool-6-thread-8) [4033c4ff] Failed in DestroyImageVDS method
2015-04-19 16:04:13,639 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-8) [4033c4ff] IrsBroker::Failed::DestroyImageVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to DestroyImageVDS, error = Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',), code = 309
2015-04-19 16:04:13,648 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (pool-6-thread-8) [4033c4ff] Command org.ovirt.engine.core.bll.DestroyImageCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to DestroyImageVDS, error = Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',), code = 309 (Failed with error StoragePoolUnknown and code 309)
2015-04-19 16:04:13,672 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (pool-6-thread-8) [4033c4ff] Transaction rolled-back for command: org.ovirt.engine.core.bll.DestroyImageCommand.
2015-04-19 16:04:13,776 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-85) [5656cb99] Command HSMGetAllTasksStatusesVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: ()
2015-04-19 16:04:13,987 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-96) [6925c3dc] Command ListVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769, vds=Host[nott-vds1,eaba7988-e524-4c19-bd7b-8f9ba0bbc769]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2015-04-19 16:04:14,003 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-96) [6925c3dc] Failure to refresh Vds runtime info: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2015-04-19 16:04:21,016 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-89) [7d2f99b] Unknown child command status for step DESTROY_IMAGE
2015-04-19 16:04:23,690 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-16) [7b969bf1] Command HSMGetAllTasksStatusesVDSCommand(HostName = nott-vds1, HostId = eaba7988-e524-4c19-bd7b-8f9ba0bbc769) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM
2015-04-19 16:04:28,536 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [18ed9a86] BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:28,544 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-8) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand
2015-04-19 16:04:28,546 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [18ed9a86] BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:28,551 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-23) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand
2015-04-19 16:04:28,574 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-8) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand
2015-04-19 16:04:28,584 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-23) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand
2015-04-19 16:04:31,050 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [723a469e] Failed child command status for step DESTROY_IMAGE
2015-04-19 16:04:31,854 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [7d2f99b] Snapshot 314e7dbc-9784-43b4-907c-7fda03e3463a images dc6ac4be-a71d-4822-bd79-8522abdee74b..16565bda-e777-4362-9bfb-81fea9ad0ca3 merged, but volume removal failed. Some or all of the following volumes may be orphaned: [dc6ac4be-a71d-4822-bd79-8522abdee74b]. Please retry Live Merge on the snapshot to complete the operation.
2015-04-19 16:04:31,906 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [624cd6d5] Failed child command status for step DESTROY_IMAGE
2015-04-19 16:04:38,635 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-21) [6740fd3e] BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:38,636 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-21) [6740fd3e] BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:38,651 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-5) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand
2015-04-19 16:04:38,654 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-39) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand
2015-04-19 16:04:38,669 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-5) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand
2015-04-19 16:04:38,670 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-39) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand
2015-04-19 16:04:42,851 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-75) [723a469e] Snapshot 314e7dbc-9784-43b4-907c-7fda03e3463a images 69c6b3b2-38f7-4b2f-9170-b9f1f46a796d..09e9930a-41b9-4eff-a31c-5b7b590bf63b merged, but volume removal failed. Some or all of the following volumes may be orphaned: [69c6b3b2-38f7-4b2f-9170-b9f1f46a796d]. Please retry Live Merge on the snapshot to complete the operation.
2015-04-19 16:04:42,862 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (DefaultQuartzScheduler_Worker-75) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand

2015-04-19 16:04:44,288 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-75) [4684f0de] All Live Merge child commands have completed, status FAILED


2015-04-19 16:04:46,002 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-75) [624cd6d5] Snapshot 314e7dbc-9784-43b4-907c-7fda03e3463a images 9f9f9f3c-37ac-4af6-9090-b8121f7028b0..81a8716d-713c-42f8-aa9c-7bd1692cd4ed merged, but volume removal failed. Some or all of the following volumes may be orphaned: [9f9f9f3c-37ac-4af6-9090-b8121f7028b0]. Please retry Live Merge on the snapshot to complete the operation.
2015-04-19 16:04:46,006 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (DefaultQuartzScheduler_Worker-75) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand
2015-04-19 16:04:48,864 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-43) BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:48,864 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-43) BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:48,876 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-25) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand
2015-04-19 16:04:48,887 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-33) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand
2015-04-19 16:04:48,892 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-25) [4a68a89f] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand
2015-04-19 16:04:48,894 ERROR [org.ovirt.engine.core.bll.DestroyImageCommand] (org.ovirt.thread.pool-7-thread-33) [2281f762] Ending command with failure: org.ovirt.engine.core.bll.DestroyImageCommand
2015-04-19 16:04:56,019 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-57) [4684f0de] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand
2015-04-19 16:04:56,039 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-57) [4684f0de] Correlation ID: 4684f0de, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snsa2' for VM 'vm_11'.
2015-04-19 16:04:59,313 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-17) [56cbba47] BaseAsyncTask::logEndTaskFailure: Task 87012cd5-4bdf-430f-a514-9d1ef3550adb (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:59,313 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-17) [56cbba47] BaseAsyncTask::logEndTaskFailure: Task 7da1ccb7-3440-4ebc-ab9e-fe4712e1493f (Parent Command RemoveSnapshotSingleDiskLive, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
2015-04-19 16:04:59,320 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-22) [624cd6d5] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand
2015-04-19 16:04:59,339 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (org.ovirt.thread.pool-7-thread-12) [723a469e] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand



2015-04-19 16:04:56,039 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-57) [4684f0de] Correlation ID: 4684f0de, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snsa2' for VM 'vm_11'.



VDSM log
-----------------------------
cbfe5fbe-9f6a-441f-bb5b-4bac3de41fdc::INFO::2015-04-19 19:01:20,332::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=4a31b260-8136-411d-9743-b3e937859097 volUUID = 54c94300-650a-4da9-9656-87969a6c60a4 legality = ILLEGAL 
0bb73a86-25be-42d5-9695-3b6bf1411d5b::INFO::2015-04-19 19:01:24,958::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=32d03b22-d9a8-40e8-a7f6-fae12cdbb840 volUUID = 2655ffd3-5c64-4cd4-8aac-16599acae719 legality = ILLEGAL 
e4ece9d5-2b01-4fda-b1af-2a6e0a766bed::INFO::2015-04-19 19:01:30,522::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=569d7156-a956-4e56-afe2-a744bcf07260 volUUID = 43c81e12-7987-4d5d-88a7-8e5dae69fb3c legality = ILLEGAL 
87012cd5-4bdf-430f-a514-9d1ef3550adb::INFO::2015-04-19 19:01:32,926::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=de1e8749-c8df-45ba-b0a9-c40331f8d8ba volUUID = 69c6b3b2-38f7-4b2f-9170-b9f1f46a796d legality = ILLEGAL 
7da1ccb7-3440-4ebc-ab9e-fe4712e1493f::INFO::2015-04-19 19:01:33,175::volume::617::Storage.Volume::(setLegality) sdUUID=3c9387ca-e89f-48c5-9dba-824639e604af imgUUID=36a57edb-ea1c-40d4-be76-8e4018513042 volUUID = 9f9f9f3c-37ac-4af6-9090-b8121f7028b0 legality = ILLEGAL 
0e13510c-2932-4ad5-b2d8-f0c0daa379b2::INFO::2015-04-19 19:15:16,359::volume::617::Storage.Volume::(setLegality) sdUUID=6210d179-81a4-4fd0-bb09-350238520812 imgUUID=12ec201c-f95d-4909-a9a3-b75688d219c3 volUUID = dc6ac4be-a71d-4822-bd79-8522abdee74b legality = ILLEGAL 

ioprocess communication (17553)::ERROR::2015-04-19 19:01:30,993::__init__::152::IOProcessClient::(_communicate) IOProcess failure
ioprocess communication (23042)::ERROR::2015-04-19 19:01:30,995::__init__::152::IOProcessClient::(_communicate) IOProcess failure
ioprocess communication (22967)::ERROR::2015-04-19 19:01:30,994::__init__::152::IOProcessClient::(_communicate) IOProcess failure
ioprocess communication (20822)::ERROR::2015-04-19 19:01:30,993::__init__::152::IOProcessClient::(_communicate) IOProcess failure
ioprocess communication (21250)::ERROR::2015-04-19 19:01:30,995::__init__::152::IOProcessClient::(_communicate) IOProcess failure
Thread-44::ERROR::2015-04-19 19:01:36,904::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 27d6175c-6a75-4eb6-910f-e9a158e63121 monitoring information
storageRefresh::ERROR::2015-04-19 19:01:47,635::lvm::336::Storage.LVM::(_reloadpvs) Missing pv: unRCWJ-idrB-HHdD-9iXo-stpp-Yqyt-0Xrpwv in vg: 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40
storageRefresh::ERROR::2015-04-19 19:01:47,635::lvm::336::Storage.LVM::(_reloadpvs) Missing pv: 3cEMHe-TWUD-iZJ2-pCYo-rEn5-EgSa-DMeSRZ in vg: 4bc98a03-c7c7-4701-b055-94f777670509
storageRefresh::ERROR::2015-04-19 19:01:47,635::lvm::336::Storage.LVM::(_reloadpvs) Missing pv: cP95qE-mfrM-qHgQ-OAH8-lHBe-fV6Q-SgTU3s in vg: 4bc98a03-c7c7-4701-b055-94f777670509
storageRefresh::ERROR::2015-04-19 19:01:50,009::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
storageRefresh::ERROR::2015-04-19 19:01:50,009::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
Thread-13::ERROR::2015-04-19 19:01:50,017::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 3eae5544-0c43-483b-bf06-118660fcbfb1
Thread-13::ERROR::2015-04-19 19:01:50,018::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 3eae5544-0c43-483b-bf06-118660fcbfb1
Thread-13::ERROR::2015-04-19 19:01:50,908::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 3c9387ca-e89f-48c5-9dba-824639e604af
Thread-13::ERROR::2015-04-19 19:01:50,908::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 3c9387ca-e89f-48c5-9dba-824639e604af
Thread-13::ERROR::2015-04-19 19:01:51,747::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70
Thread-13::ERROR::2015-04-19 19:01:51,747::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70
Thread-13::ERROR::2015-04-19 19:01:52,487::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 6210d179-81a4-4fd0-bb09-350238520812
Thread-13::ERROR::2015-04-19 19:01:52,487::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 6210d179-81a4-4fd0-bb09-350238520812
storageRefresh::ERROR::2015-04-19 19:01:55,178::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
storageRefresh::ERROR::2015-04-19 19:01:55,178::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
Thread-24::ERROR::2015-04-19 19:04:33,400::task::866::Storage.TaskManager.Task::(_setError) Task=`7c53834e-55ec-4ff6-ab1f-75525e274f30`::Unexpected error
Thread-24::ERROR::2015-04-19 19:04:33,405::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',)", 'code': 309}}
Thread-25::ERROR::2015-04-19 19:04:33,416::task::866::Storage.TaskManager.Task::(_setError) Task=`da0969e6-651b-4c72-b6b5-2023194ecaaa`::Unexpected error
Thread-25::ERROR::2015-04-19 19:04:33,419::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Unknown pool id, pool not connected: (u'6d96f52d-d791-4f66-83bd-2553ca0f3012',)", 'code': 309}}
Thread-26::ERROR::2015-04-19 19:04:33,553::task::866::Storage.TaskManager.Task::(_setError) Task=`7e8fb4f9-78da-4e4a-95d8-8e421272f29b`::Unexpected error
Thread-26::ERROR::2015-04-19 19:04:33,555::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Not SPM: ()', 'code': 654}}
Thread-81::ERROR::2015-04-19 19:04:43,466::task::866::Storage.TaskManager.Task::(_setError) Task=`0f553806-cdef-4414-b0f1-fc249955bec5`::Unexpected error
Thread-81::ERROR::2015-04-19 19:04:43,468::dispatcher::79::Storage.Dispatcher::(wrapper) Secured object is not in safe state
Thread-108::ERROR::2015-04-19 19:04:50,712::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
Thread-108::ERROR::2015-04-19 19:04:50,713::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
Thread-107::ERROR::2015-04-19 19:04:50,716::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
Thread-107::ERROR::2015-04-19 19:04:50,716::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
Thread-109::ERROR::2015-04-19 19:04:50,721::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
Thread-109::ERROR::2015-04-19 19:04:50,721::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
Thread-110::ERROR::2015-04-19 19:04:50,776::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
Thread-110::ERROR::2015-04-19 19:04:50,776::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
Thread-111::ERROR::2015-04-19 19:04:50,871::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
Thread-111::ERROR::2015-04-19 19:04:50,872::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
Thread-112::ERROR::2015-04-19 19:04:50,984::lvm::404::Storage.LVM::(_reloadvgs) vg 4bc98a03-c7c7-4701-b055-94f777670509 has pv_count 4 but pv_names ('/dev/mapper/360060160f4a030005aa83a07a83de411', '/dev/mapper/360060160f4a03000f2a7a56e563ee411')
Thread-112::ERROR::2015-04-19 19:04:50,984::lvm::404::Storage.LVM::(_reloadvgs) vg 40138e3b-14b1-44b1-b5e9-6bd6ff17ca40 has pv_count 2 but pv_names ('/dev/mapper/36006016012823000e014928626bbe311',)
25c3fa2c-8b8d-4a02-9536-7411ea52a192::ERROR::2015-04-19 19:15:18,442::task::866::Storage.TaskManager.Task::(_setError) Task=`25c3fa2c-8b8d-4a02-9536-7411ea52a192`::Unexpected error
3783d493-afe6-468c-b4df-185283ee48ab::ERROR::2015-04-19 19:15:18,497::task::866::Storage.TaskManager.Task::(_setError) Task=`3783d493-afe6-468c-b4df-185283ee48ab`::Unexpected error

Comment 1 Kevin Alon Goldblatt 2015-04-19 15:12:02 UTC
Libvirt rpms on host
--------------------------------
libvirt-client-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.2.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.2.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.2.x86_64

Comment 2 Kevin Alon Goldblatt 2015-04-19 15:19:10 UTC
Created attachment 1016104 [details]
engine, server, vdsm and libvirt logs

Added logs

NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g.

Current Engine time: Sun Apr 19 18:18:05 IDT 2015
Current Host time:   Sun Apr 19 21:18:35 IDT 2015

Comment 3 Allon Mureinik 2015-04-19 16:40:32 UTC
The error seems unrelated to the restart (missing PVs?), so I'm not quite sure what I'd want the disk status to be in...

Adam/Greg, please take a look.

Kevin - assuming the domain is still functioning (which doesn't seem necessary from the error, if you manually update the DB to mark those disks as OK [1], can you run the VM?


[1] in psql, use:

UPDATE images
SET    imagestatus = 0
WHERE  image_group_id = <the disk's ID>;

COMMIT;

Comment 4 Greg Padgett 2015-04-19 17:45:13 UTC
Just to verify, the illegal state referred to here is illegal on the host?  (I don't see any references to illegal disks in the supplied engine log.)

Comment 5 Allon Mureinik 2015-04-20 11:01:00 UTC
(In reply to Greg Padgett from comment #4)
> Just to verify, the illegal state referred to here is illegal on the host? 
> (I don't see any references to illegal disks in the supplied engine log.)
Then again, the vdsm log has no errors at all.
Kevin, are you sure these are the right logs?

Comment 6 Kevin Alon Goldblatt 2015-04-20 14:08:48 UTC
(In reply to Greg Padgett from comment #4)
> Just to verify, the illegal state referred to here is illegal on the host? 
> (I don't see any references to illegal disks in the supplied engine log.)

(In reply to Greg Padgett from comment #4)
> Just to verify, the illegal state referred to here is illegal on the host? 
> (I don't see any references to illegal disks in the supplied engine log.)

1. NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g.

Current Engine time: Sun Apr 19 18:18:05 IDT 2015
Current Host time:   Sun Apr 19 21:18:35 IDT 2015


2. ILLEGAL status appears in the host vdsm log e.g.

cbfe5fbe-9f6a-441f-bb5b-4bac3de41fdc::INFO::2015-04-19 19:01:20,332::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=4a31b260-8136-411d-9743-b3e937859097 volUUID = 54c94300-650a-4da9-9656-87969a6c60a4 legality = ILLEGAL 
0bb73a86-25be-42d5-9695-3b6bf1411d5b::INFO::2015-04-19 19:01:24,958::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=32d03b22-d9a8-40e8-a7f6-fae12cdbb840 volUUID = 2655ffd3-5c64-4cd4-8aac-16599acae719 legality = ILLEGAL 
e4ece9d5-2b01-4fda-b1af-2a6e0a766bed::INFO::2015-04-19 


3. In the GUI we see the ILLEGAL status displayed in the Storage > <Storage Domain> > Disk Snapshots > Status > illegal
Added screen shot

4. 2 of the disk snapshots are displayed as Illegal since the failed Live Delete Snapshot operation. Subsequently I can no longer delete this snapshot

Comment 7 Kevin Alon Goldblatt 2015-04-20 14:10:54 UTC
(In reply to Allon Mureinik from comment #5)
> (In reply to Greg Padgett from comment #4)
> > Just to verify, the illegal state referred to here is illegal on the host? 
> > (I don't see any references to illegal disks in the supplied engine log.)
> Then again, the vdsm log has no errors at all.
> Kevin, are you sure these are the right logs?

Appologies, the vdsm log rotated before I loaded them. Adding logs again

Comment 8 Kevin Alon Goldblatt 2015-04-20 14:12:59 UTC
Created attachment 1016393 [details]
engine, server, vdsm and libvirt logs

Added logs again as the vdsm log was old due to log rotation

Comment 9 Kevin Alon Goldblatt 2015-04-20 14:48:25 UTC
Edited the summary of the bz to reflect the problem

Comment 10 Adam Litke 2015-04-20 16:06:00 UTC
The included vdsm log does not show the actual merge operation so it's hard to tell if there were any specific errors with that flow.  What I see here are some errors from LVM indicating missing PVs.  It seems like there could have been some external corruption of the storage domain (perhaps someone removed some LUNs on the server side?

Comment 11 Allon Mureinik 2015-04-20 17:52:40 UTC
(In reply to Kevin Alon Goldblatt from comment #6)
> 4. 2 of the disk snapshots are displayed as Illegal since the failed Live
> Delete Snapshot operation. Subsequently I can no longer delete this snapshot

Just to clarify - it's just the snapshot that's marked illegal?
The Disk object is OK?
Can you stop the VM and re-run it?

Comment 12 Kevin Alon Goldblatt 2015-04-21 07:32:33 UTC
(In reply to Allon Mureinik from comment #11)
> (In reply to Kevin Alon Goldblatt from comment #6)
> > 4. 2 of the disk snapshots are displayed as Illegal since the failed Live
> > Delete Snapshot operation. Subsequently I can no longer delete this snapshot
> 
> Just to clarify - it's just the snapshot that's marked illegal?
> The Disk object is OK?
> Can you stop the VM and re-run it?

Yes I can stop and re-run the VM, But any attempt to delete the same snapshot again fails.

Comment 13 Kevin Alon Goldblatt 2015-04-21 07:35:37 UTC
(In reply to Adam Litke from comment #10)
> The included vdsm log does not show the actual merge operation so it's hard
> to tell if there were any specific errors with that flow.  What I see here
> are some errors from LVM indicating missing PVs.  It seems like there could
> have been some external corruption of the storage domain (perhaps someone
> removed some LUNs on the server side?

No LUNS are missing on the storage.

Comment 14 Kevin Alon Goldblatt 2015-04-21 07:46:22 UTC
Created attachment 1016713 [details]
engine, server, vdsm and libvirt logs

Added an additiona vdsm log that includes the earlier time from before the merge
vdsm.log.22.xz

Comment 15 Kevin Alon Goldblatt 2015-04-21 08:57:57 UTC
Upon further investigation I found that the illegal disk is displayed in the Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This means that the database has 2 tables that are not synchronized

Comment 16 Allon Mureinik 2015-04-21 09:00:24 UTC
(In reply to Kevin Alon Goldblatt from comment #15)
> Upon further investigation I found that the illegal disk is displayed in the
> Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This
> means that the database has 2 tables that are not synchronized
No, it doesn't.
It means the disk as a whole is OK (i.e., can be used for VM operations like running it), but there's a problem somewhere down the chain which prevents things like merging and requires some intervention.

Comment 17 Greg Padgett 2015-04-21 16:55:42 UTC
(In reply to Kevin Alon Goldblatt from comment #15)
> Upon further investigation I found that the illegal disk is displayed in the
> Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks.
<snip>

Okay, I think I see what's going on.  The reboot happened during the destroy_image phase of the merge attempts and thus the merges were successful but the snapshot volumes are still on the storage.  We see they're illegal, as they should be; however, it's designed so that you can retry the merge on the illegal volumes in order to clean them up.  This last bit isn't working for some reason or another.  I'm looking into it.

Comment 18 Greg Padgett 2015-04-24 21:05:34 UTC
After further testing, it appears the live merge could not be retried because the command would not converge after failure to delete the image. [1]  With the linked patch, the failure causes overall command failure, leaving the snapshot and image unlocked and in the proper state to allow a retry.

There may still be a failure due to the timing of the restart, but note that live merge isn't designed to necessarily succeed if restarted at any point--but rather to recover gracefully if re-executed after a failure such as this.

Please test with this patch once it's in the build and let me know if you still see any problems.

[1] In this case, DestroyImageCommand failed, which is the third part (of 3) of the live merge flow on the engine.  This child command is an old-style command (runs an SPM task) whereas its parent which controls the flow uses newer infrastructure (CommandCoordinator).  The two don't always play well together, and we had already added a special case to get the endAction flow working between the two.  This was extended further to avoid an endless loop of status-checking if the child command fails.

Comment 19 Greg Padgett 2015-04-25 03:56:14 UTC
Found some issues later in the retry flow, moving back to assigned.

Comment 20 Greg Padgett 2015-04-30 20:10:06 UTC
A partial fix is posted to gerrit.  Looks like there's still a case where the engine can "forget" that it perfomed a merge, which will cause an error when it retries during command recovery.

Comment 21 Allon Mureinik 2015-05-03 09:37:58 UTC
(In reply to Greg Padgett from comment #20)
> A partial fix is posted to gerrit.  Looks like there's still a case where
> the engine can "forget" that it perfomed a merge, which will cause an error
> when it retries during command recovery.
So let's clone the bug for better tracking.
We'll have this bug for 3.5.4 to track the fix we DO have, and use the other one to track the remaining problem.

Comment 23 Greg Padgett 2015-06-17 19:05:37 UTC
(In reply to Allon Mureinik from comment #21)
> (In reply to Greg Padgett from comment #20)
> > A partial fix is posted to gerrit.  Looks like there's still a case where
> > the engine can "forget" that it perfomed a merge, which will cause an error
> > when it retries during command recovery.
> So let's clone the bug for better tracking.
> We'll have this bug for 3.5.4 to track the fix we DO have, and use the other
> one to track the remaining problem.

Opened bug 1227497 to track that issue.

Comment 24 Kevin Alon Goldblatt 2015-07-07 10:15:45 UTC
Verified using Version:
--------------------------------------------------------
vdsm-4.17.0-1054.git562e711.el7.noarchrpm -qa 
ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch


Verified with the following scenario:
-------------------------------------------------------
Steps to Reproduce:
1. Create a VM with several disks including block preallocated and thin and nfs preallocated and thin
2. Started the VM 
3. Created 3 snapshots snsa1, snsa2, snsa3
4. Deleted snapshot snsa2 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation was SUCCESSFUL
5. Deleted snapshot snsa1 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation was SUCCESSFUL

Verified using a nested host!

Moving to VERIFIED!

Comment 25 dev-unix-virtualization 2015-11-18 21:04:59 UTC
Hi, 

 Is it possible to port this fix back into RHEV 3.5 please ? This is the version we've integrated against and would really like to see a fix for that version.

-Amit Mitkar

Comment 26 Yaniv Lavi 2015-11-22 12:45:53 UTC
(In reply to dev-unix-virtualization from comment #25)
> Hi, 
> 
>  Is it possible to port this fix back into RHEV 3.5 please ? This is the
> version we've integrated against and would really like to see a fix for that
> version.
> 
> -Amit Mitkar

It was included in 3.5.4.

Comment 27 Allon Mureinik 2016-03-10 10:41:33 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 28 Allon Mureinik 2016-03-10 10:46:41 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 29 Allon Mureinik 2016-03-10 12:03:14 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE