Bug 971066

Summary: engine: engine sends createVolume and continues to cloneImageStructure steps even though the host that the vm is suppose to be running on has rebooted
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Ayal Baron <abaron>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acanan, acathrow, amureini, iheim, jkt, lpeer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Flags: scohen: needinfo+
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 17:19:18 EST Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
logs none

Description Dafna Ron 2013-06-05 11:19:01 EDT
Created attachment 757254 [details]

Description of problem:

I ran a vm on the hsm and once the vm was up I rebooted the host. 
when the host was down I started a LSM for the vm disk (the vm was in status UP)
although the host that the vm is running on is down engine sends the create volume task to the vdsm, which succeeds (although LSM requires prepareImagePath which cannot be run since the vm is down) and than we continue to the cloneImageStructure step. 
we fail the LSM with java.lang.ArrayIndexOutOfBoundsException: -1

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. in iscsi storage with 2 hosts, create and run a vm with one disk
2. run the vm on the hsm
3. when the vm changes status to up reboot the hsm host
4. once the host is down LSM the vm disk

Actual results:

even though the host is down engine sends the createVolume and than to cloneImageStructure we fail on LSM with the LSM on java.lang.ArrayIndexOutOfBoundsException: -1

Expected results:

if the vm is running on a host which is having issues we should not send LSM command.

Additional info:

2013-06-05 17:27:39,508 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-71) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkExc
eption: java.net.ConnectException: Connection refused

2013-06-05 17:27:51,268 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (pool-4-thread-50) [a6146fd] Running command: LiveMigrateVmDisksCommandTask handler: Liv
eSnapshotTaskHandler internal: false. Entities affected :  ID: 6c2d6773-1b7d-4035-be96-406973b72e8b Type: Disk,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage
2013-06-05 17:27:51,296 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-50) [a6146fd] START, GetImageInfoVDSCommand( storagePoolId = 7fd
33b43-a9f4-4eb7-a885-e9583a929ceb, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 81ef11d0-4c0c-47b4-8953-d61a6af442d8, imageGroupId = 6088697e-814
c-490c-8623-15c3c254b8ff, imageId = 6c2d6773-1b7d-4035-be96-406973b72e8b), log id: 20affcf5

2013-06-05 17:27:52,187 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-4-thread-50) [a6146fd] -- createVolume parameters:
                size=3,221,225,472 bytes
2013-06-05 17:27:53,078 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-4-thread-50) [a6146fd] FINISH, CreateSnapshotVDSCommand, return: dce18c12-b057-4eb1-80a7-b134e8ffc344, log id: 3ef40849

aa26304d-98a2-44f7-8c0c-e4aeaa6cc30c :
        verb = createVolume
        id = aa26304d-98a2-44f7-8c0c-e4aeaa6cc30c

34bc661b-905f-487e-81bc-758361fefa72 :
        verb = cloneImageStructure
        id = 34bc661b-905f-487e-81bc-758361fefa72
Comment 1 Allon Mureinik 2013-07-09 11:31:11 EDT
This is by definition a race condition - even if we succeed with createVolume, check the domain's status and send cloneImageGroupStructure the domain can fail (e.g., lose connectivity) a nanosecond later, before cloneImageGroupStructure reaches VDSM.

In a broader perspective, the same can be said for any engine operation which does not consist of a single, atomic, VDSM verb.

Since the ArrayIndexOutOfBounds issue in the recovery has been fixed (see external tracker for details), I propose closing this BZ.

Sean, since you gave pm_ack+, I'd like your input on this please - what am I missing?
Comment 2 Sean Cohen 2013-07-11 03:28:39 EDT
> Since the ArrayIndexOutOfBounds issue in the recovery has been fixed (see
> external tracker for details), I propose closing this BZ.

Please add external tracker & close.
Comment 3 Elad 2013-07-22 05:02:21 EDT
After a connectivity failure with the host, LSM starts:

2013-07-22 11:41:31,360 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (pool-5-thread-47) [306b136f] Running command: LiveMigrateVmDisksCommand Task handler: LiveSnapshotTaskHandler internal: fals
e. Entities affected :  ID: 9ae3b1fc-a073-4e39-b79a-1daedcc7e73d Type: Disk,  ID: 875c4059-7557-4fbf-8bd1-f71dc303f36f Type: Storage
2013-07-22 11:41:31,514 INFO  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-47) Running command: CreateAllSnapshotsFromVmCommand internal: true. Entities affected :  ID: c8eb585f-fa95-
4e5f-9317-49c7bfae9744 Type: VM
2013-07-22 11:41:31,532 INFO  [org.ovirt.engine.core.bll.CreateSnapshotCommand] (pool-5-thread-47) Running command: CreateSnapshotCommand internal: true. Entities affected :  ID: 00000000-0000-0000-0000-0000000000
00 Type: Storage
2013-07-22 11:41:31,535 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) START, CreateSnapshotVDSCommand( storagePoolId = 0793e09a-4d30-4855-a4c6-c1e7cc3eae3f, ignoreFa
iloverLimit = false, storageDomainId = 9c11959c-cba7-4be7-b0a2-9150f6e9789c, imageGroupId = 50f0dd2e-2897-4d24-b436-e17fdf6b70e7, imageSizeInBytes = 2147483648, volumeFormat = COW, newImageId = 12fc0ae1-52b9-447e-
a4b2-a5ab1b419330, newImageDescription = , imageId = 9ae3b1fc-a073-4e39-b79a-1daedcc7e73d, sourceImageGroupId = 50f0dd2e-2897-4d24-b436-e17fdf6b70e7), log id: 60fc88e2
2013-07-22 11:41:31,535 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new param
eters: description and UUID
2013-07-22 11:41:31,535 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) -- createVolume parameters:
                size=2,147,483,648 bytes

We get host connectivity ERROR:

2013-07-22 11:41:43,700 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) Command SnapshotVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
2013-07-22 11:41:43,700 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) FINISH, SnapshotVDSCommand, log id: cc4d635
2013-07-22 11:41:43,701 WARN  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-47) Wasnt able to live snpashot due to error: VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022). VM will still be configured to the new created snapshot
2013-07-22 11:41:43,709 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-47) Correlation ID: null, Call Stack: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022)

Engine calls to end and clear LSM tasks:

2013-07-22 11:41:53,747 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-47) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result.
2013-07-22 11:41:53,748 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-47) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk succeeded, clearing tasks.

Verified on RHEVM3.3 - IS5:
Comment 4 Itamar Heim 2014-01-21 17:19:18 EST
Closing - RHEV 3.3 Released
Comment 5 Itamar Heim 2014-01-21 17:25:30 EST
Closing - RHEV 3.3 Released