Bug 1468883

Summary: Cold storage migration failure doesn't raise DiskException as expected
Product: [oVirt] ovirt-engine Reporter: Lilach Zitnitski <lzitnits>
Component: BLL.StorageAssignee: Daniel Erez <derez>
Status: CLOSED NOTABUG QA Contact: Raz Tamir <ratamir>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: bugs, tnisan
Target Milestone: ovirt-4.2.0Keywords: Automation, Regression
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-10 09:47:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Lilach Zitnitski 2017-07-09 11:37:11 UTC
Description of problem:
On 4.1.3, when restarting vdsmd service on the SPM host during cold storage migration, the migration fails and raises error - 
2017-07-09 11:46:31,395 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected:
        Status: 409
        Reason: Conflict
        Detail: [Cannot move Virtual Disk. The relevant Storage Domain's status is Unknown.]
On 4.2 this error is no longer shown in the same scenario and in the REST log no helpful information is shown about the reason the operation failed.

2017-07-09 11:59:18,886 - MainThread - disks - INFO - Using Correlation-Id: disks_syncAction_3842a0cc-3429-4a82
2017-07-09 11:59:19,039 - MainThread - core_api - DEBUG - Request POST response time: 0.017
2017-07-09 11:59:19,039 - MainThread - disks - DEBUG - Cleaning Correlation-Id: disks_syncAction_3842a0cc-3429-4a82
2017-07-09 11:59:19,040 - MainThread - disks - DEBUG - Response code is valid: [200, 201]
2017-07-09 11:59:19,042 - MainThread - disks - DEBUG - Action status is valid: ['complete']
2017-07-09 11:59:19,093 - MainThread - art.logging - ERROR - Status: failed

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20170707124946.gitf15a6d9.el7.centos.noarch
vdsm-4.20.1-157.git79aca9e.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. create vm with few disks 
2. start migrating the disks to another storage domain
3. restart vdsmd service on spm host

Actual results:
Migration fails but no exception is raised 

Expected results:
Migration should fail with DiskException 

Additional info:

4.1.3 engine log

2017-07-09 11:46:31,103+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDisksCommand] (default task-10) [disks_syncAction_3a6224da-b3c5-46fd] Running command: MoveDisksCommand internal: false. Entities affected :  ID: a9c880ae-5fc3-4d0f-a124-8f477945c890 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER
2017-07-09 11:46:31,290+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (default task-10) [disks_syncAction_3a6224da-b3c5-46fd] Lock Acquired to object 'EngineLock:{exclusiveLocks='[a9c880ae-5fc3-4d0f-a124-8f477945c890=DISK]', sharedLocks='[80252462-ea54-4aaa-83ab-5d0997186088=VM]'}'
2017-07-09 11:46:31,354+03 WARN  [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (default task-10) [disks_syncAction_3a6224da-b3c5-46fd] Validation of action 'MoveOrCopyDisk' failed for user admin@internal-authz. Reasons: VAR__ACTION__MOVE,VAR__TYPE__DISK,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status Unknown
2017-07-09 11:46:31,360+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (default task-10) [disks_syncAction_3a6224da-b3c5-46fd] Lock freed to object 'EngineLock:{exclusiveLocks='[a9c880ae-5fc3-4d0f-a124-8f477945c890=DISK]', sharedLocks='[80252462-ea54-4aaa-83ab-5d0997186088=VM]'}'
2017-07-09 11:46:31,374+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-10) [] Operation Failed: [Cannot move Virtual Disk. The relevant Storage Domain's status is Unknown.]

4.2 engine.log

2017-07-09 12:27:29,572+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-25) [disks_syncAction_207d937a-6eea-43a7] EVENT_ID: USER_MOVED_DISK(2,008), User admin@internal-authz moving disk disk_virtiocow_0912252
731 to domain nfs_1.
2017-07-09 12:27:36,055+03 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Broken pipe
2017-07-09 12:27:36,057+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [ad72377] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{hostId='f4085187-a3a2-4b3b-b46f-02d0b9e553fe'}
)' execution failed: VDSGenericException: VDSNetworkException: Broken pipe

Comment 1 Lilach Zitnitski 2017-07-09 11:37:44 UTC
Created attachment 1295612 [details]
logs

Comment 2 Red Hat Bugzilla Rules Engine 2017-07-09 16:43:19 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Daniel Erez 2017-07-10 09:47:22 UTC
The described scenario doesn't sound like a regression but merely a change in behavior. Moving a disk is an async operation, hence when using rest-api its status should be polled. The mentioned exception [1] is part of the validation upon initiating the operation. I.e. when moving a disk, we validate the domain status *before* the operation begins, not during the operation.

[1] "Cannot move Virtual Disk. The relevant Storage Domain's status is Unknown."