Created attachment 757254 [details] logs Description of problem: I ran a vm on the hsm and once the vm was up I rebooted the host. when the host was down I started a LSM for the vm disk (the vm was in status UP) although the host that the vm is running on is down engine sends the create volume task to the vdsm, which succeeds (although LSM requires prepareImagePath which cannot be run since the vm is down) and than we continue to the cloneImageStructure step. we fail the LSM with java.lang.ArrayIndexOutOfBoundsException: -1 Version-Release number of selected component (if applicable): sf14 How reproducible: 100% Steps to Reproduce: 1. in iscsi storage with 2 hosts, create and run a vm with one disk 2. run the vm on the hsm 3. when the vm changes status to up reboot the hsm host 4. once the host is down LSM the vm disk Actual results: even though the host is down engine sends the createVolume and than to cloneImageStructure we fail on LSM with the LSM on java.lang.ArrayIndexOutOfBoundsException: -1 Expected results: if the vm is running on a host which is having issues we should not send LSM command. Additional info: 2013-06-05 17:27:39,508 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-71) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkExc eption: java.net.ConnectException: Connection refused 2013-06-05 17:27:51,268 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (pool-4-thread-50) [a6146fd] Running command: LiveMigrateVmDisksCommandTask handler: Liv eSnapshotTaskHandler internal: false. Entities affected : ID: 6c2d6773-1b7d-4035-be96-406973b72e8b Type: Disk, ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage 2013-06-05 17:27:51,296 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-50) [a6146fd] START, GetImageInfoVDSCommand( storagePoolId = 7fd 33b43-a9f4-4eb7-a885-e9583a929ceb, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 81ef11d0-4c0c-47b4-8953-d61a6af442d8, imageGroupId = 6088697e-814 c-490c-8623-15c3c254b8ff, imageId = 6c2d6773-1b7d-4035-be96-406973b72e8b), log id: 20affcf5 2013-06-05 17:27:52,187 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-4-thread-50) [a6146fd] -- createVolume parameters: sdUUID=81ef11d0-4c0c-47b4-8953-d61a6af442d8 spUUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb imgGUID=6088697e-814c-490c-8623-15c3c254b8ff size=3,221,225,472 bytes volFormat=COW volType=Sparse volUUID=dce18c12-b057-4eb1-80a7-b134e8ffc344 descr= srcImgGUID=6088697e-814c-490c-8623-15c3c254b8ff srcVolUUID=6c2d6773-1b7d-4035-be96-406973b72e8b 2013-06-05 17:27:53,078 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-4-thread-50) [a6146fd] FINISH, CreateSnapshotVDSCommand, return: dce18c12-b057-4eb1-80a7-b134e8ffc344, log id: 3ef40849 aa26304d-98a2-44f7-8c0c-e4aeaa6cc30c : verb = createVolume id = aa26304d-98a2-44f7-8c0c-e4aeaa6cc30c 34bc661b-905f-487e-81bc-758361fefa72 : verb = cloneImageStructure id = 34bc661b-905f-487e-81bc-758361fefa72
This is by definition a race condition - even if we succeed with createVolume, check the domain's status and send cloneImageGroupStructure the domain can fail (e.g., lose connectivity) a nanosecond later, before cloneImageGroupStructure reaches VDSM. In a broader perspective, the same can be said for any engine operation which does not consist of a single, atomic, VDSM verb. Since the ArrayIndexOutOfBounds issue in the recovery has been fixed (see external tracker for details), I propose closing this BZ. Sean, since you gave pm_ack+, I'd like your input on this please - what am I missing?
> Since the ArrayIndexOutOfBounds issue in the recovery has been fixed (see > external tracker for details), I propose closing this BZ. Please add external tracker & close.
After a connectivity failure with the host, LSM starts: 2013-07-22 11:41:31,360 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (pool-5-thread-47) [306b136f] Running command: LiveMigrateVmDisksCommand Task handler: LiveSnapshotTaskHandler internal: fals e. Entities affected : ID: 9ae3b1fc-a073-4e39-b79a-1daedcc7e73d Type: Disk, ID: 875c4059-7557-4fbf-8bd1-f71dc303f36f Type: Storage 2013-07-22 11:41:31,514 INFO [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-47) Running command: CreateAllSnapshotsFromVmCommand internal: true. Entities affected : ID: c8eb585f-fa95- 4e5f-9317-49c7bfae9744 Type: VM 2013-07-22 11:41:31,532 INFO [org.ovirt.engine.core.bll.CreateSnapshotCommand] (pool-5-thread-47) Running command: CreateSnapshotCommand internal: true. Entities affected : ID: 00000000-0000-0000-0000-0000000000 00 Type: Storage 2013-07-22 11:41:31,535 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) START, CreateSnapshotVDSCommand( storagePoolId = 0793e09a-4d30-4855-a4c6-c1e7cc3eae3f, ignoreFa iloverLimit = false, storageDomainId = 9c11959c-cba7-4be7-b0a2-9150f6e9789c, imageGroupId = 50f0dd2e-2897-4d24-b436-e17fdf6b70e7, imageSizeInBytes = 2147483648, volumeFormat = COW, newImageId = 12fc0ae1-52b9-447e- a4b2-a5ab1b419330, newImageDescription = , imageId = 9ae3b1fc-a073-4e39-b79a-1daedcc7e73d, sourceImageGroupId = 50f0dd2e-2897-4d24-b436-e17fdf6b70e7), log id: 60fc88e2 2013-07-22 11:41:31,535 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new param eters: description and UUID 2013-07-22 11:41:31,535 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) -- createVolume parameters: sdUUID=9c11959c-cba7-4be7-b0a2-9150f6e9789c spUUID=0793e09a-4d30-4855-a4c6-c1e7cc3eae3f imgGUID=50f0dd2e-2897-4d24-b436-e17fdf6b70e7 size=2,147,483,648 bytes volFormat=COW volType=Sparse volUUID=12fc0ae1-52b9-447e-a4b2-a5ab1b419330 descr= srcImgGUID=50f0dd2e-2897-4d24-b436-e17fdf6b70e7 srcVolUUID=9ae3b1fc-a073-4e39-b79a-1daedcc7e73d We get host connectivity ERROR: 2013-07-22 11:41:43,700 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) Command SnapshotVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused 2013-07-22 11:41:43,700 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) FINISH, SnapshotVDSCommand, log id: cc4d635 2013-07-22 11:41:43,701 WARN [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-47) Wasnt able to live snpashot due to error: VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022). VM will still be configured to the new created snapshot 2013-07-22 11:41:43,709 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-47) Correlation ID: null, Call Stack: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022) Engine calls to end and clear LSM tasks: 2013-07-22 11:41:53,747 INFO [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-47) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result. 2013-07-22 11:41:53,748 INFO [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-47) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk succeeded, clearing tasks. Verified on RHEVM3.3 - IS5: rhevm-3.3.0-0.9.master.el6ev.noarch
Closing - RHEV 3.3 Released