Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1580437

Summary: Live disk migration for the 2nd time after a failed attempt failed with with error replicaErr and code 55 , RCA shows srcDisk= dstDisk dictionaries are the same
Product: [oVirt] ovirt-engine Reporter: Avihai <aefrat>
Component: BLL.StorageAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED CURRENTRELEASE QA Contact: Yosi Ben Shimon <ybenshim>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.3.5CC: aefrat, bugs, ebenahar, tnisan
Target Milestone: ovirt-4.2.5Keywords: Automation
Target Release: ---Flags: rule-engine: ovirt-4.2+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.2.4.5 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-31 15:25:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1590292    
Bug Blocks:    
Attachments:
Description Flags
engine , vdsm logs none

Description Avihai 2018-05-21 13:12:28 UTC
Created attachment 1439614 [details]
engine , vdsm logs

Description of problem:
1st live migration of the disk failed as expected due to SPM network connection to storage domain was blocked.
2nd live migration started with a new SPM and no network blockage but it failed with Drive  replication error -> Looking deeper in VDSM log shows that
srcDisk= dstDisk dictionaries are the same so the VDSM, for an unknown reason. tries to move the disk to the same storage domain & fails.

FROM SOME REASON srcDisk= dstDisk dict's are the same :
srcDisk={'device': 'disk', 'poolID': '8435060f-b44c-4442-946a-61a8bd8bebfb', 'volumeID': 'abf2d3c5-19ed-461c-8a75-35075e249fec', 'domainID': '14725dff-0afe-48ca-a5da-eef2e6100581', 'imageID': '4ed56955-f336-4321-9e0e-a5b786035c4d'},

dstDisk={'device': 'disk', 'poolID': '8435060f-b44c-4442-946a-61a8bd8bebfb', 'volumeID': 'abf2d3c5-19ed-461c-8a75-35075e249fec', 'domainID': '14725dff-0afe-48ca-a5da-eef2e6100581', 'imageID': '4ed56955-f336-4321-9e0e-a5b786035c4d'}

disk name = vm_TestCase6000_2114461073_Disk_0 
disk id = 4ed56955-f336-4321-9e0e-a5b786035c4d
Disk is created on SD iscsi_1 

SD iscsi_0 id = 1021fbcc-4adf-4db6-b64b-b0aa7ae716cb
SD iscsi_1 is = 14725dff-0afe-48ca-a5da-eef2e6100581


Engine log:
2018-05-21 14:51:16,946+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] START, VmRepl
icateDiskStartVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='95d45dc3-2f90-4b0d-bf33-653e5ff940c6', vmId='db21f7c8-f146-4b6e-8fe3-e10584ae08ed', storagePoolId='8435060f-b44c-4442-946a-61a
8bd8bebfb', srcStorageDomainId='14725dff-0afe-48ca-a5da-eef2e6100581', targetStorageDomainId='1021fbcc-4adf-4db6-b64b-b0aa7ae716cb', imageGroupId='4ed56955-f336-4321-9e0e-a5b786035c4d', imageId='4cc15d8b-323f-47
92-bfbd-45568a2597aa'}), log id: 795f93be
2018-05-21 14:51:17,560+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] Failed in 'Vm
ReplicateDiskStartVDS' method
2018-05-21 14:51:17,568+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] EVENT_ID: VDS_BRO
KER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command VmReplicateDiskStartVDS failed: Drive replication error
2018-05-21 14:51:17,568+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] Command 'org.
ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand' return value 'StatusOnlyReturn [status=Status [code=55, message=Drive replication error]]'
2018-05-21 14:51:17,568+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] HostName = ho
st_mixed_3
2018-05-21 14:51:17,568+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] Command 'VmRe
plicateDiskStartVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='95d45dc3-2f90-4b0d-bf33-653e5ff940c6', vmId='db21f7c8-f146-4b6e-8fe3-e10584ae08ed', storagePoolId='8435060f-b44c-4442-946a-6
1a8bd8bebfb', srcStorageDomainId='14725dff-0afe-48ca-a5da-eef2e6100581', targetStorageDomainId='1021fbcc-4adf-4db6-b64b-b0aa7ae716cb', imageGroupId='4ed56955-f336-4321-9e0e-a5b786035c4d', imageId='4cc15d8b-323f-
4792-bfbd-45568a2597aa'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error, code = 55
2018-05-21 14:51:17,569+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] FINISH, VmRep
licateDiskStartVDSCommand, log id: 795f93be
2018-05-21 14:51:17,569+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] Failed VmReplicateDiskSta
rt (Disk '4ed56955-f336-4321-9e0e-a5b786035c4d' , VM 'db21f7c8-f146-4b6e-8fe3-e10584ae08ed')
2018-05-21 14:51:17,569+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] Command 'LiveMigrateDis
k' id: '2dad16c2-dc3e-4a3b-bd00-7cc86bbb69ec' with children [7b4a42f3-ef2f-460e-98bb-5765377c4825, dcdc830e-f1c8-45f1-8368-ab0f107a397e] failed when attempting to perform the next operation, marking as 'ACTIVE'
2018-05-21 14:51:17,570+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] EngineException: Drive 
replication error (Failed with error replicaErr and code 55): org.ovirt.engine.core.common.errors.EngineException: EngineException: Drive replication error (Failed with error replicaErr and code 55)
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.replicateDiskStart(LiveMigrateDiskCommand.java:503) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.performNextOperation(LiveMigrateDiskCommand.java:233) [bll.jar:]
        at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) [bll.jar:]
        at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:68) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:145) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:106) [bll.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_171]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_171]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.
0.redhat-1]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_171]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)
2018-05-21 14:51:17,573+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [disks_syncAction_378e7e4a-51a9-4f19] Command 'LiveMigrateDisk' id: '2dad16c2-dc3e-4a3b-bd00-7cc86bbb69ec' child commands '[7b4a42f3-ef2f-460e-98bb-5765377c4825, dcdc830e-f1c8-45f1-8368-ab0f107a397e]' executions were completed, status 'FAILED'
2018-05-21 14:51:18,627+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-9) [disks_syncAction_378e7e4a-51a9-4f19] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.


VDSM(HSM log) - FROM SOME REASON srcDisk= dstDisk dictionaries are the same:
2018-05-21 14:50:34,639+0300 INFO  (jsonrpc/4) [api.virt] START diskReplicateFinish(srcDisk={'device': 'disk', 'poolID': '8435060f-b44c-4442-946a-61a8bd8bebfb', 'volumeID': 'abf2d3c5-19ed-461c-8a75-35075e249fec'
, 'domainID': '14725dff-0afe-48ca-a5da-eef2e6100581', 'imageID': '4ed56955-f336-4321-9e0e-a5b786035c4d'}, dstDisk={'device': 'disk', 'poolID': '8435060f-b44c-4442-946a-61a8bd8bebfb', 'volumeID': 'abf2d3c5-19ed-4
61c-8a75-35075e249fec', 'domainID': '14725dff-0afe-48ca-a5da-eef2e6100581', 'imageID': '4ed56955-f336-4321-9e0e-a5b786035c4d'}) from=::ffff:10.35.162.7,41488, flow_id=disks_syncAction_22ee66b1-abe9-4dd3, vmId=db
21f7c8-f146-4b6e-8fe3-e10584ae08ed (api:46)
2018-05-21 14:50:34,639+0300 ERROR (jsonrpc/4) [api] FINISH diskReplicateFinish error=Replication not in progress.: {'driveName': 'vda', 'vmId': 'db21f7c8-f146-4b6e-8fe3-e10584ae08ed', 'srcDisk': {'device': 'disk', 'poolID': '8435060f-b44c-4442-946a-61a8bd8bebfb', 'volumeID': 'abf2d3c5-19ed-461c-8a75-35075e249fec', 'domainID': '14725dff-0afe-48ca-a5da-eef2e6100581', 'imageID': '4ed56955-f336-4321-9e0e-a5b786035c4d'}} (api:129)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 568, in diskReplicateFinish
    return self.vm.diskReplicateFinish(srcDisk, dstDisk)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4665, in diskReplicateFinish
    srcDisk=srcDisk)
ReplicationNotInProgress: Replication not in progress.: {'driveName': 'vda', 'vmId': 'db21f7c8-f146-4b6e-8fe3-e10584ae08ed', 'srcDisk': {'device': 'disk', 'poolID': '8435060f-b44c-4442-946a-61a8bd8bebfb', 'volumeID': 'abf2d3c5-19ed-461c-8a75-35075e249fec', 'domainID': '14725dff-0afe-48ca-a5da-eef2e6100581', 'imageID': '4ed56955-f336-4321-9e0e-a5b786035c4d'}}
2018-05-21 14:50:34,642+0300 INFO  (jsonrpc/4) [api.virt] FINISH diskReplicateFinish return={'status': {'message': "Replication not in progress.: {'driveName': 'vda', 'vmId': 'db21f7c8-f146-4b6e-8fe3-e10584ae08ed', 'srcDisk': {'device': 'disk', 'poolID': '8435060f-b44c-4442-946a-61a8bd8bebfb', 'volumeID': 'abf2d3c5-19ed-461c-8a75-35075e249fec', 'domainID': '14725dff-0afe-48ca-a5da-eef2e6100581', 'imageID': '4ed56955-f336-4321-9e0e-a5b786035c4d'}}", 'code': 88}} from=::ffff:10.35.162.7,41488, flow_id=disks_syncAction_22ee66b1-abe9-4dd3, vmId=db21f7c8-f146-4b6e-8fe3-e10584ae08ed (api:52)
2018-05-21 14:50:34,643+0300 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateFinish failed (error 88) in 0.00 seconds (__init__:573)




Version-Release number of selected component (if applicable):
Engine 4.2.3.5-0.1
VDSM 4.20.27.1-1.

How reproducible:
50%

Steps to Reproduce:
Scenario:
0) create VM from template on ISCSI domain 'iscsi_1' , disk name ' vm_TestCase6000_2114461073_Disk_0 '  & start VM.
1) VM started on host_mixed_3 , SPM was host_mixed_1
2) Start disk ' vm_TestCase6000_2114461073_Disk_0 ' migration from SD 'iscsi_1' to iscsi_0' & block Network connection from SPM(host_mixed_1) to the storage domain .
3) disk migration failed as expected .
4) Unblock Network connection from host_mixed_1 to SD (New SPM host_mixed_2  elected )
5) Start again disk ' vm_TestCase6000_2114461073_Disk_0 ' migration from SD 'iscsi_1' to iscsi_0' 


Actual results:
Live disk migration for the 2nd time failed with Drive  replication error - srcDisk= dstDisk dictionaries are the same

Expected results:
Disk migration for the 2nd time  from SD 'iscsi_1' to iscsi_0' should succeed. 

Additional info:
Timeline from automation log:
2018-05-21 14:46:37,292 - MainThread - art.ll_lib.disks - INFO - Get disk vm_TestCase6000_2114461073_Disk_0 storage domain
2018-05-21 14:46:37,898 - MainThread - art.ll_lib.disks - INFO - Disk vm_TestCase6000_2114461073_Disk_0 storage domain is: iscsi_1
2018-05-21 14:46:37,899 - MainThread - art.ll_lib.disks - INFO - Find the storage domain type that the disk vm_TestCase6000_2114461073_Disk_0 is found on
2018-05-21 14:46:38,473 - MainThread - art.ll_lib.disks - INFO - Disk 'vm_TestCase6000_2114461073_Disk_0' is using storage domain of type 'iscsi'
2018-05-21 14:46:39,094 - MainThread - art.ll_lib.disks - INFO - Searching for storage domain with the same type
2018-05-21 14:46:39,291 - MainThread - art.ll_lib.disks - INFO - Disk vm_TestCase6000_2114461073_Disk_0 improper storage domain is: iscsi_0
2018-05-21 14:46:39,916 - MainThread - root - INFO - Start VM vm_TestCase6000_2114461073 with {'wait_for_ip': False, 'pause': False, 'use_cloud_init': False, 'timeout': 600, 'wait_for_status': 'up'}
2018-05-21 14:46:40,049 - MainThread - art.ll_lib.vms - INFO - start VM vm_TestCase6000_2114461073
2018-05-21 14:46:40,050 - MainThread - vms - INFO - Using Correlation-Id: vms_syncAction_8f071067-9e19-4fa5
2018-05-21 14:46:40,680 - MainThread - vms - INFO - Waiting for query `name=vm_TestCase6000_2114461073 and status=up or name=vm_TestCase6000_2114461073 and status=up` and event_id None up to 600 seconds,sampling every 10 second.
2018-05-21 14:47:51,796 - MainThread - art.ll_lib.vms - INFO - Migrating disk vm_TestCase6000_2114461073_Disk_0 of vm vm_TestCase6000_2114461073 to domain iscsi_0
2018-05-21 14:47:51,796 - MainThread - art.ll_lib.disks - INFO - Get disk vm_TestCase6000_2114461073_Disk_0 storage domain
2018-05-21 14:47:52,460 - MainThread - art.ll_lib.disks - INFO - Disk vm_TestCase6000_2114461073_Disk_0 storage domain is: iscsi_1
2018-05-21 14:47:52,461 - MainThread - art.ll_lib.vms - INFO - Moving disk vm_TestCase6000_2114461073_Disk_0 attached to vm vm_TestCase6000_2114461073 from storage domain iscsi_1 to storage domain iscsi_0
2018-05-21 14:47:53,030 - MainThread - disks - INFO - Using Correlation-Id: disks_syncAction_22ee66b1-abe9-4dd3
2018-05-21 14:47:59,262 - MainThread - art.ll_lib.disks - INFO - Waiting for status ok on disks ['4ed56955-f336-4321-9e0e-a5b786035c4d']
2018-05-21 14:50:38,260 - MainThread - art.ll_lib.vms - INFO - Waiting until snapshots: 'all snapshots' of vm_TestCase6000_2114461073 vm are in one of following states: ok
2018-05-21 14:50:38,488 - MainThread - art.ll_lib.jobs - INFO - Waiting for jobs ['Migrating Disk .* from .* to .*']
2018-05-21 14:50:38,531 - MainThread - art.ll_lib.jobs - INFO - Active jobs: []
2018-05-21 14:50:38,577 - MainThread - art.ll_lib.jobs - INFO - All jobs are gone
2018-05-21 14:50:38,577 - MainThread - art.ll_lib.jobs - INFO - Waiting for jobs ['Removing Snapshot .* of VM .*']
2018-05-21 14:50:38,622 - MainThread - art.ll_lib.jobs - DEBUG - Skipping duplicate log-messages...
2018-05-21 14:50:38,668 - MainThread - art.ll_lib.jobs - DEBUG - Skipping duplicate log-messages...
2018-05-21 14:50:38,668 - MainThread - art.ll_lib.disks - INFO - Get disk vm_TestCase6000_2114461073_Disk_0 storage domain
2018-05-21 14:50:39,221 - MainThread - art.ll_lib.disks - INFO - Disk vm_TestCase6000_2114461073_Disk_0 storage domain is: iscsi_1
2018-05-21 14:50:39,222 - MainThread - art.ll_lib.vms - INFO - Verifying whether disk vm_TestCase6000_2114461073_Disk_0 moved from storage domain iscsi_1 to iscsi_1
2018-05-21 14:50:40,630 - MainThread - root - INFO - Waits until SPM gets elected in DataCenter golden_env_mixed with {'sleep': 10, 'timeout': 120}
2018-05-21 14:50:40,633 - MainThread - root - INFO - Locate and return SPM or HSM host from specific datacenter golden_env_mixed with {'spm': True}
2018-05-21 14:50:40,808 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_1.
2018-05-21 14:50:40,810 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2018-05-21 14:50:40,876 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_1 is: none
2018-05-21 14:50:40,876 - MainThread - root - WARNING - Failed to checking storage pool manager (spm) status of the host host_mixed_1.
2018-05-21 14:50:40,879 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_2.
2018-05-21 14:50:40,882 - MainThread - root - INFO - Get host object by host_name host_mixed_2. with {'attribute': 'name'}
2018-05-21 14:50:40,941 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_2 is: spm
2018-05-21 14:50:40,943 - MainThread - root - INFO - Get SPM host from the list of hosts host_mixed_1, host_mixed_2, host_mixed_3
2018-05-21 14:50:40,944 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_1.
2018-05-21 14:50:40,946 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2018-05-21 14:50:41,009 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_1 is: none
2018-05-21 14:50:41,009 - MainThread - root - WARNING - Failed to checking storage pool manager (spm) status of the host host_mixed_1.
2018-05-21 14:50:41,012 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_2.
2018-05-21 14:50:41,014 - MainThread - root - INFO - Get host object by host_name host_mixed_2. with {'attribute': 'name'}
2018-05-21 14:50:41,080 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_2 is: spm
2018-05-21 14:50:41,084 - MainThread - root - DEBUG - Skipping duplicate log-messages...
2018-05-21 14:50:41,228 - MainThread - art.ll_lib.hosts - INFO - Check if host host_mixed_2 has state up
2018-05-21 14:50:41,228 - MainThread - art.ll_lib.hosts - INFO - Host host_mixed_2 has state up
2018-05-21 14:50:41,228 - MainThread - art.ll_lib.vms - INFO - Migrating disk vm_TestCase6000_2114461073_Disk_0 of vm vm_TestCase6000_2114461073 to domain iscsi_0
2018-05-21 14:50:41,229 - MainThread - art.ll_lib.disks - INFO - Get disk vm_TestCase6000_2114461073_Disk_0 storage domain
2018-05-21 14:50:41,718 - MainThread - art.ll_lib.disks - INFO - Disk vm_TestCase6000_2114461073_Disk_0 storage domain is: iscsi_1
2018-05-21 14:50:41,719 - MainThread - art.ll_lib.vms - INFO - Moving disk vm_TestCase6000_2114461073_Disk_0 attached to vm vm_TestCase6000_2114461073 from storage domain iscsi_1 to storage domain iscsi_0
2018-05-21 14:50:42,193 - MainThread - disks - INFO - Using Correlation-Id: disks_syncAction_378e7e4a-51a9-4f19

Comment 1 Benny Zlotnik 2018-05-21 14:51:03 UTC
I cannot find the reason snapshot creation failed (I assume it's because of the SPM reelection, but just to be certain), can you provide the logs from the other host?

2018-05-21 14:50:25,308+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] BaseAsyncTask::logEndTaskFailure: Task '8f9e51b2-76a2-4c9f-ab3a-6028f7628f4b' (Parent Command 'CreateSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').',
-- Exception: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').'

The other issue here is LSM continues after create snapshot failed, this is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1497355

Regarding the srcDisk=dstDisk - there are two types of calls we make with diskReplicateFinish:
1. When we finished syncData phase of LSM and we want to pivot - we call it with srcDisk and dstDisk according to their storage domains
2. When we have a failure during LSM and we want to abort the blockjob and when we call diskReplicateFinish with the same src and dst vdsm calls blockJobAbort with the abort flag instead of the pivot flag (there is room for improvement here)

2 currently doesn't work well and it is handled in https://gerrit.ovirt.org/#/c/90532/

Comment 2 Avihai 2018-05-21 15:12:21 UTC
(In reply to Benny Zlotnik from comment #1)
> I cannot find the reason snapshot creation failed (I assume it's because of
> the SPM reelection, but just to be certain), can you provide the logs from
> the other host?
I attached both other hosts(host_mixed_1 &2) vdsm logs of the relevant times.

This failure time 2018-05-21 14:50:25 match the FIRST move of the disk which is expected to fail due to blocking network connection between SPM of that time (host_mixed_1) and SD 'iscsi_0' .

> 
> 2018-05-21 14:50:25,308+03 ERROR
> [org.ovirt.engine.core.bll.tasks.SPMAsyncTask]
> (EE-ManagedThreadFactory-engineScheduled-Thread-39) []
> BaseAsyncTask::logEndTaskFailure: Task
> '8f9e51b2-76a2-4c9f-ab3a-6028f7628f4b' (Parent Command 'CreateSnapshot',
> Parameters Type
> 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with
> failure:
> -- Result: 'cleanSuccess'
> -- Message: 'VDSGenericException: VDSTaskResultNotSuccessException:
> TaskState contained successful return code, but a non-success result
> ('cleanSuccess').',
> -- Exception: 'VDSGenericException: VDSTaskResultNotSuccessException:
> TaskState contained successful return code, but a non-success result
> ('cleanSuccess').'
> 
> The other issue here is LSM continues after create snapshot failed, this is
> being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1497355
> Regarding the srcDisk=dstDisk - there are two types of calls we make with
> diskReplicateFinish:
> 1. When we finished syncData phase of LSM and we want to pivot - we call it
> with srcDisk and dstDisk according to their storage domains
> 2. When we have a failure during LSM and we want to abort the blockjob and
> when we call diskReplicateFinish with the same src and dst vdsm calls
> blockJobAbort with the abort flag instead of the pivot flag (there is room
> for improvement here)
> 
> 2 currently doesn't work well and it is handled in
> https://gerrit.ovirt.org/#/c/90532/

Comment 5 Tal Nisan 2018-06-12 13:32:11 UTC
Bug 1590292 currently block verification, moving to ASSIGNED until bug 1590292 will be solved

Comment 6 Sandro Bonazzola 2018-06-29 15:33:07 UTC
bug #1590292 has been fixed in ovirt-engine-4.2.4.5.
Moving this back to QE, can be verified with 4.2.4 GA release.

Comment 7 Yosi Ben Shimon 2018-07-03 09:06:49 UTC
Tested using:
ovirt-engine-4.2.4.5-0.1.el7_3.noarch

Ran the automation TC (TestCase6000).

Actual result:
The 1st migration failed as expected.
The 2nd migration succeeded after the connection between the SPM and the storage was unblocked.

Moving to VERIFIED

Comment 8 Sandro Bonazzola 2018-07-31 15:25:50 UTC
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.