Bug 971066 - engine: engine sends createVolume and continues to cloneImageStructure steps even though the host that the vm is suppose to be running on has rebooted
engine: engine sends createVolume and continues to cloneImageStructure steps ...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Linux
unspecified Severity high
: ---
: 3.3.0
Assigned To: Ayal Baron
Elad
storage
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-05 11:19 EDT by Dafna Ron
Modified: 2016-02-10 11:48 EST (History)
9 users (show)

See Also:
Fixed In Version: is2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 17:19:18 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
scohen: needinfo+


Attachments (Terms of Use)
logs (4.42 MB, application/x-gzip)
2013-06-05 11:19 EDT, Dafna Ron
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 15133 None None None Never

  None (edit)
Description Dafna Ron 2013-06-05 11:19:01 EDT
Created attachment 757254 [details]
logs

Description of problem:

I ran a vm on the hsm and once the vm was up I rebooted the host. 
when the host was down I started a LSM for the vm disk (the vm was in status UP)
although the host that the vm is running on is down engine sends the create volume task to the vdsm, which succeeds (although LSM requires prepareImagePath which cannot be run since the vm is down) and than we continue to the cloneImageStructure step. 
we fail the LSM with java.lang.ArrayIndexOutOfBoundsException: -1

Version-Release number of selected component (if applicable):

sf14

How reproducible:

100%

Steps to Reproduce:
1. in iscsi storage with 2 hosts, create and run a vm with one disk
2. run the vm on the hsm
3. when the vm changes status to up reboot the hsm host
4. once the host is down LSM the vm disk

Actual results:

even though the host is down engine sends the createVolume and than to cloneImageStructure we fail on LSM with the LSM on java.lang.ArrayIndexOutOfBoundsException: -1

Expected results:

if the vm is running on a host which is having issues we should not send LSM command.

Additional info:

2013-06-05 17:27:39,508 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-71) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkExc
eption: java.net.ConnectException: Connection refused

2013-06-05 17:27:51,268 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (pool-4-thread-50) [a6146fd] Running command: LiveMigrateVmDisksCommandTask handler: Liv
eSnapshotTaskHandler internal: false. Entities affected :  ID: 6c2d6773-1b7d-4035-be96-406973b72e8b Type: Disk,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage
2013-06-05 17:27:51,296 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-50) [a6146fd] START, GetImageInfoVDSCommand( storagePoolId = 7fd
33b43-a9f4-4eb7-a885-e9583a929ceb, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 81ef11d0-4c0c-47b4-8953-d61a6af442d8, imageGroupId = 6088697e-814
c-490c-8623-15c3c254b8ff, imageId = 6c2d6773-1b7d-4035-be96-406973b72e8b), log id: 20affcf5

2013-06-05 17:27:52,187 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-4-thread-50) [a6146fd] -- createVolume parameters:
                sdUUID=81ef11d0-4c0c-47b4-8953-d61a6af442d8
                spUUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb
                imgGUID=6088697e-814c-490c-8623-15c3c254b8ff
                size=3,221,225,472 bytes
                volFormat=COW
                volType=Sparse
                volUUID=dce18c12-b057-4eb1-80a7-b134e8ffc344
                descr=
                srcImgGUID=6088697e-814c-490c-8623-15c3c254b8ff
                srcVolUUID=6c2d6773-1b7d-4035-be96-406973b72e8b
                
2013-06-05 17:27:53,078 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-4-thread-50) [a6146fd] FINISH, CreateSnapshotVDSCommand, return: dce18c12-b057-4eb1-80a7-b134e8ffc344, log id: 3ef40849

aa26304d-98a2-44f7-8c0c-e4aeaa6cc30c :
        verb = createVolume
        id = aa26304d-98a2-44f7-8c0c-e4aeaa6cc30c

34bc661b-905f-487e-81bc-758361fefa72 :
        verb = cloneImageStructure
        id = 34bc661b-905f-487e-81bc-758361fefa72
Comment 1 Allon Mureinik 2013-07-09 11:31:11 EDT
This is by definition a race condition - even if we succeed with createVolume, check the domain's status and send cloneImageGroupStructure the domain can fail (e.g., lose connectivity) a nanosecond later, before cloneImageGroupStructure reaches VDSM.

In a broader perspective, the same can be said for any engine operation which does not consist of a single, atomic, VDSM verb.

Since the ArrayIndexOutOfBounds issue in the recovery has been fixed (see external tracker for details), I propose closing this BZ.

Sean, since you gave pm_ack+, I'd like your input on this please - what am I missing?
Comment 2 Sean Cohen 2013-07-11 03:28:39 EDT
> Since the ArrayIndexOutOfBounds issue in the recovery has been fixed (see
> external tracker for details), I propose closing this BZ.

Please add external tracker & close.
Comment 3 Elad 2013-07-22 05:02:21 EDT
After a connectivity failure with the host, LSM starts:

2013-07-22 11:41:31,360 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (pool-5-thread-47) [306b136f] Running command: LiveMigrateVmDisksCommand Task handler: LiveSnapshotTaskHandler internal: fals
e. Entities affected :  ID: 9ae3b1fc-a073-4e39-b79a-1daedcc7e73d Type: Disk,  ID: 875c4059-7557-4fbf-8bd1-f71dc303f36f Type: Storage
2013-07-22 11:41:31,514 INFO  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-47) Running command: CreateAllSnapshotsFromVmCommand internal: true. Entities affected :  ID: c8eb585f-fa95-
4e5f-9317-49c7bfae9744 Type: VM
2013-07-22 11:41:31,532 INFO  [org.ovirt.engine.core.bll.CreateSnapshotCommand] (pool-5-thread-47) Running command: CreateSnapshotCommand internal: true. Entities affected :  ID: 00000000-0000-0000-0000-0000000000
00 Type: Storage
2013-07-22 11:41:31,535 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) START, CreateSnapshotVDSCommand( storagePoolId = 0793e09a-4d30-4855-a4c6-c1e7cc3eae3f, ignoreFa
iloverLimit = false, storageDomainId = 9c11959c-cba7-4be7-b0a2-9150f6e9789c, imageGroupId = 50f0dd2e-2897-4d24-b436-e17fdf6b70e7, imageSizeInBytes = 2147483648, volumeFormat = COW, newImageId = 12fc0ae1-52b9-447e-
a4b2-a5ab1b419330, newImageDescription = , imageId = 9ae3b1fc-a073-4e39-b79a-1daedcc7e73d, sourceImageGroupId = 50f0dd2e-2897-4d24-b436-e17fdf6b70e7), log id: 60fc88e2
2013-07-22 11:41:31,535 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) -- CreateSnapshotVDSCommand::ExecuteIrsBrokerCommand: calling 'createVolume' with two new param
eters: description and UUID
2013-07-22 11:41:31,535 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (pool-5-thread-47) -- createVolume parameters:
                sdUUID=9c11959c-cba7-4be7-b0a2-9150f6e9789c
                spUUID=0793e09a-4d30-4855-a4c6-c1e7cc3eae3f
                imgGUID=50f0dd2e-2897-4d24-b436-e17fdf6b70e7
                size=2,147,483,648 bytes
                volFormat=COW
                volType=Sparse
                volUUID=12fc0ae1-52b9-447e-a4b2-a5ab1b419330
                descr=
                srcImgGUID=50f0dd2e-2897-4d24-b436-e17fdf6b70e7
                srcVolUUID=9ae3b1fc-a073-4e39-b79a-1daedcc7e73d

We get host connectivity ERROR:

2013-07-22 11:41:43,700 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) Command SnapshotVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
2013-07-22 11:41:43,700 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) FINISH, SnapshotVDSCommand, log id: cc4d635
2013-07-22 11:41:43,701 WARN  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-47) Wasnt able to live snpashot due to error: VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022). VM will still be configured to the new created snapshot
2013-07-22 11:41:43,709 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-47) Correlation ID: null, Call Stack: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022)



Engine calls to end and clear LSM tasks:

2013-07-22 11:41:53,747 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-47) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result.
2013-07-22 11:41:53,748 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-47) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk succeeded, clearing tasks.



Verified on RHEVM3.3 - IS5:
rhevm-3.3.0-0.9.master.el6ev.noarch
Comment 4 Itamar Heim 2014-01-21 17:19:18 EST
Closing - RHEV 3.3 Released
Comment 5 Itamar Heim 2014-01-21 17:25:30 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.