Bug 1566471 - Retry LSM after previous failure fails with exception in vdsm
Summary: Retry LSM after previous failure fails with exception in vdsm
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.2
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Benny Zlotnik
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On: 1697261
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-12 11:55 UTC by Kevin Alon Goldblatt
Modified: 2020-05-11 10:35 UTC (History)
5 users (show)

Fixed In Version: rhv-4.4.0-28
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-11 10:35:19 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+
ylavi: exception+


Attachments (Terms of Use)
engine and vdsm logs (5.46 MB, application/x-gzip)
2018-04-12 11:55 UTC, Kevin Alon Goldblatt
no flags Details
Logs (3.30 MB, application/zip)
2019-04-03 15:34 UTC, Shir Fishbain
no flags Details
logs_1566471 (2.39 MB, application/zip)
2019-04-07 17:01 UTC, Shir Fishbain
no flags Details

Description Kevin Alon Goldblatt 2018-04-12 11:55:20 UTC
Created attachment 1420819 [details]
engine and vdsm logs

Description of problem:
Fail LSM from gluster to iscsi domain by blocking source (gluster domain) with iptables on host. After restoring access to the domain try LSM again. This time it fails with an exception in vdsm with Resource factory failed to create resource

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.2.6-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64

How reproducible:
CHECKED THIS WITH DEVELOPMENT AND THIS SEEMED DIFFERENT TO https://bugzilla.redhat.com/show_bug.cgi?id=1565685 AND THAT A DIFFERENT DEFECT IS NEEDED

Steps to Reproduce:
Verified with the following scenario:
--------------------------------------
1. Create Vm with disks, OS and write data to disks
2. LSM Gluster disk top ISCSI domain -> during LSM block access from host to Source gluster storage >>>> LSM fails and host is moved to NON-Operational
3. Restore connection to gluster storage - host is moved Up again
4. Try running the LSM again >>>> after the auto create snapshot operation vdsm host throws an exception and the LSM fails. NB!!! ->> Temporary snapshots are NOT removed (need new bug for this?)
5  NB!! Same LSM to NFS domain works fine???

Actual results:
LSM retry failed

Expected results:
LSM retry after previous failure should work

Additional info:
Vdsm.log:
--------------------------------------------
2018-04-11 13:01:56,863+0300 DEBUG (tasks/3) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n512 bytes (512 B) copied, 0.0419374 s, 12.2 kB/s\n'; <rc> = 0 (commands:86)
2018-04-11 13:01:56,865+0300 DEBUG (tasks/3) [storage.Misc] err: ['1+0 records in', '1+0 records out', '512 bytes (512 B) copied, 0.0419374 s, 12.2 kB/s'], size: 512 (misc:124)
2018-04-11 13:01:56,867+0300 WARN  (tasks/3) [storage.ResourceManager] Resource factory failed to create resource '01_img_57635752-238d-4948-81f0-e4f010845d16.0bba11aa-6502-4e49-b1f5-df7a5eb2cc33'. Canceling req
uest. (resourceManager:543)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 539, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 213, in getChain
    if srcVol.isLeaf():
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1430, in isLeaf
    return self._manifest.isLeaf()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 138, in isLeaf
    return self.getVolType() == sc.type2name(sc.LEAF_VOL)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 134, in getVolType
    self.voltype = self.getMetaParam(sc.VOLTYPE)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 118, in getMetaParam
    meta = self.getMetadata()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 112, in getMetadata
    md = VolumeMetadata.from_lines(lines)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volumemetadata.py", line 103, in from_lines
    "Missing metadata key: %s: found: %s" % (e, md))
MetaDataKeyNotFoundError: Meta Data key not found error: ("Missing metadata key: 'DOMAIN': found: {'NONE': '#######################################################################################################
###################################################################################################################################################################################################################
############################################################################################################################################################################################'}",)
2018-04-11 13:01:56,870+0300 DEBUG (tasks/3) [storage.ResourceManager.Request] (ResName='01_img_57635752-238d-4948-81f0-e4f010845d16.0bba11aa-6502-4e49-b1f5-df7a5eb2cc33', ReqID='fcfb576c-7f91-474f-90a9-b32aa6b3b7d4') Canceled request (resourceManager:194)
2018-04-11 13:01:56,872+0300 WARN  (tasks/3) [storage.ResourceManager.Request] (ResName='01_img_57635752-238d-4948-81f0-e4f010845d16.0bba11aa-6502-4e49-b1f5-df7a5eb2cc33', ReqID='fcfb576c-7f91-474f-90a9-b32aa6b3b7d4') Tried to cancel a processed request (resourceManager:187)
2018-04-11 13:01:56,918+0300 ERROR (tasks/3) [storage.TaskManager.Task] (Task='24016b80-e4fa-4179-bc45-f90e4af5b95b') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1731, in cloneImageStructure
    with nested(*resList):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1726, in <genexpr>
    resList = (rm.acquireResource(*x) for x in sorted((
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1025, in acquireResource
    return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 475, in acquireResource
    raise se.ResourceAcqusitionFailed()
ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: ()
2018-04-11 13:01:56,920+0300 DEBUG (tasks/3) [storage.TaskManager.Task] (Task='24016b80-e4fa-4179-bc45-f90e4af5b95b') Task._run: 24016b80-e4fa-4179-bc45-f90e4af5b95b () {} failed - stopping task (task:894)
:



engine.log
--------------------------------------
--- *** First LSM with block of iptables --- ***
2018-04-10 18:16:35,233+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-22) [5741b23a-5e62-4f66-a2ad-110f4dbca85f] EVENT_ID: USER_MOVED_DISK(2,008), User admin@inter
nal-authz moving disk vm10042018_Disk3 to domain iscsi_2.
2018-04-10 18:16:35,255+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-22) [5741b23a-5e62-4f66-a2ad-110f4dbca85f] Lock freed to object 'EngineLock:{exclusiveLocks='[0bba11aa-6502
-4e49-b1f5-df7a5eb2cc33=DISK]', sharedLocks=''}'
.
.
.
--- *** Auto snapshot created during LSM --- ***
2018-04-10 18:16:53,669+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-17) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'vm10042018_Disk3 Auto-generated for Live Storage Migration' creation for VM 'vm10042018' has been completed.

.
.
.
--- ***Create image place holder fails --- ***
2018-04-10 18:19:48,350+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-69) [] BaseAsyncTask::logEndTaskFailure: Task 'db0eeea2-71dc-44c7-9e25-183434dc58ff
' (Parent Command 'CreateImagePlaceholder', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanFailure'
-- Message: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanFailure').',
-- Exception: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanFailure').'
2018-04-10 18:19:48,352+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-69) [] CommandAsyncTask::endActionIfNecessary: All tasks of command '5f62defd-3
6c1-443b-b903-fca494ed3243' has ended -> executing 'endAction'
2018-04-10 18:19:48,352+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-69) [] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: '5
f62defd-36c1-443b-b903-fca494ed3243'): calling endAction '.
2018-04-10 18:19:48,353+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-257036) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to end
Action 'CreateImagePlaceholder',
2018-04-10 18:19:48,372+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.CreateImagePlaceholderCommand] (EE-ManagedThreadFactory-engine-Thread-257036) [5741b23a-5e62-4f66-a2ad-110f4dbca85f] Ending command 'org.ov
irt.engine.core.bll.storage.lsm.CreateImagePlaceholderCommand' with failure.

.
.
.
--- *** Failure of LSM after blocking Source domain --- ***
2018-04-10 18:19:57,737+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [5741b23a-5e62-4f66-a2ad-110f4dbca85f] Ending command 'org.ovir
t.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2018-04-10 18:19:57,747+03 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [5741b23a-5e62-4f66-a2ad-110f4dbca85f] Trying to release exclusive lock w
hich does not exist, lock key: '87a3516c-a6ce-4afc-abb8-8bc2eda67e02LIVE_STORAGE_MIGRATION'
2018-04-10 18:19:57,747+03 INFO  [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [5741b23a-5e62-4f66-a2ad-110f4dbca85f] Lock freed to object 'En
gineLock:{exclusiveLocks='[87a3516c-a6ce-4afc-abb8-8bc2eda67e02=LIVE_STORAGE_MIGRATION]', sharedLocks=''}'
2018-04-10 18:19:57,793+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [5741b23a-5e62-4f66-a2ad-110f4dbca85f] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk vm10042018_Disk3 to domain iscsi_2.
.
.
.

--- *** SECOND LSM Retry--- ***
--- *** Start 2nd attempt of LSM after restoring the connection to source domain *** ---
2018-04-11 13:01:37,769+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [d3455377-2064-4ff7-8cf1-20782734e9df] EVENT_ID: USER_MOVED_DISK(2,008), User admin@inter
nal-authz moving disk vm10042018_Disk3 to domain iscsi_2.
.
.
.
--- *** Create automatic snapshot *** ---
2018-04-11 13:01:54,502+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'vm10042018_Disk3 Auto-generated for Live Storage Migration' creation for VM 'vm10042018' has been completed.
2018-04-11 13:01:54,626+03 INFO  [org.ovirt.engine.core.bll.storage.lsm.CreateImagePlaceholderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [d3455377-2064-4ff7-8cf1-20782734e9df] Running command: CreateImagePlaceholderCommand internal: true.
.
.
.
2018-04-11 13:01:55,452+03 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [d3455377-2064-4ff7-8cf1-20782734e9df] Adding task '24016b80-e4fa-4179-bc45
-f90e4af5b95b' (Parent Command 'CreateImagePlaceholder', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet..
2018-04-11 13:01:55,470+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [d3455377-2064-4ff7-8cf1-20782734e9df] BaseAsyncTask::startPollingTask: Startin
g to poll task '24016b80-e4fa-4179-bc45-f90e4af5b95b'.
2018-04-11 13:01:57,557+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-73) [d3455377-2064-4ff7-8cf1-20782734e9df] Command 'LiveMigrateDi
sk' (id: 'bd57eb56-7910-476d-af0a-2f2a41f484d2') waiting on child command id: '8a7480af-3819-4189-9c5c-f1ca820142ed' type:'CreateImagePlaceholder' to complete
2018-04-11 13:02:00,973+03 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] Polling and updating Async Tasks: 2 tasks, 1 tasks to poll now
2018-04-11 13:02:01,007+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] Failed in 'HSMGetAllTasksStatusesVDS' method
2018-04-11 13:02:01,015+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: Could not acquire resource. Probably resource factory threw an exception.: ()
2018-04-11 13:02:01,016+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] SPMAsyncTask::PollTask: Polling task '24016b80-e4fa-4179-bc45-f90e4af5b95b' (Parent Command 'CreateImagePlaceholder', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'.
2018-04-11 13:02:01,021+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] BaseAsyncTask::logEndTaskFailure: Task '24016b80-e4fa-4179-bc45-f90e4af5b95b' (Parent Command 'CreateImagePlaceholder', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Could not acquire resource. Probably resource factory threw an exception.: (), code = 100',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Could not acquire resource. Probably resource factory threw an exception.: (), code = 100'
2018-04-11 13:02:01,022+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] CommandAsyncTask::endActionIfNecessary: All tasks of command '8a7480af-3819-4189-9c5c-f1ca820142ed' has ended -> executing 'endAction'
2018-04-11 13:02:01,022+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: '8a7480af-3819-4189-9c5c-f1ca820142ed'): calling endAction '.
2018-04-11 13:02:01,023+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-283906) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'CreateImagePlaceholder',
2018-04-11 13:02:01,038+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.CreateImagePlaceholderCommand] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] Ending command 'org.ovirt.engine.core.bll.storage.lsm.CreateImagePlaceholderCommand' with failure.
2018-04-11 13:02:01,055+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'CreateImagePlaceholder' completed, handling the result.
2018-04-11 13:02:01,055+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'CreateImagePlaceholder' succeeded, clearing tasks.
2018-04-11 13:02:01,055+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] SPMAsyncTask::ClearAsyncTask: Attempting to clear task '24016b80-e4fa-4179-bc45-f90e4af5b95b'
2018-04-11 13:02:01,060+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] START, SPMClearTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='a9e2b4b2-6c07-4af8-8fb0-5675878e2376', ignoreFailoverLimit='false', taskId='24016b80-e4fa-4179-bc45-f90e4af5b95b'}), log id: 1ac62db6
2018-04-11 13:02:01,065+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] START, HSMClearTaskVDSCommand(HostName = host_mixed_2, HSMTaskGuidBaseVDSCommandParameters:{hostId='fea2403a-5581-4920-80db-a015412f72ef', taskId='24016b80-e4fa-4179-bc45-f90e4af5b95b'}), log id: 3b2ed7e0
2018-04-11 13:02:01,126+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] FINISH, HSMClearTaskVDSCommand, log id: 3b2ed7e0
2018-04-11 13:02:01,126+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] FINISH, SPMClearTaskVDSCommand, log id: 1ac62db6
2018-04-11 13:02:01,132+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] BaseAsyncTask::removeTaskFromDB: Removed task '24016b80-e4fa-4179-bc45-f90e4af5b95b' from DataBase
2018-04-11 13:02:01,132+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-283906) [d3455377-2064-4ff7-8cf1-20782734e9df] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '8a7480af-3819-4189-9c5c-f1ca820142ed'

--- *** Failing the LSM *** ---
2018-04-11 13:02:01,646+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-15) [d3455377-2064-4ff7-8cf1-20782734e9df] Command 'LiveMigrateDisk' id: 'bd57eb56-7910-476d-af0a-2f2a41f484d2' child commands '[e57682c3-c77e-49cb-8421-0fd9c1f046d0, 8a7480af-3819-4189-9c5c-f1ca820142ed]' executions were completed, status 'FAILED'
2018-04-11 13:02:02,733+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [d3455377-2064-4ff7-8cf1-20782734e9df] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
t.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2018-04-11 13:02:02,742+03 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [d3455377-2064-4ff7-8cf1-20782734e9df] Trying to release exclusive lock which does not exist, lock key: '87a3516c-a6ce-4afc-abb8-8bc2eda67e02LIVE_STORAGE_MIGRATION'
2018-04-11 13:02:02,742+03 INFO  [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [d3455377-2064-4ff7-8cf1-20782734e9df] Lock freed to object 'EngineLock:{exclusiveLocks='[87a3516c-a6ce-4afc-abb8-8bc2eda67e02=LIVE_STORAGE_MIGRATION]', sharedLocks=''}'
2018-04-11 13:02:02,773+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [d3455377-2064-4ff7-8cf1-20782734e9df] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk vm10042018_Disk3 to domain iscsi_2.
2018-04-11 13:02:04,874+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [] START, DumpXmlsVDSCommand(HostName = host_mixed_2, Params:{hostId='fea2403a-5581-4920-80db-a015412f72ef', vmIds='[87a3516c-a6ce-4afc-abb8-8bc2eda67e02]'}), log id: 4d9d8373
2018-04-11 13:02:05,043+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [] FINISH, DumpXmlsVDSCommand, return: [{vmId=87a3516c-a6ce-4afc-abb8-8bc2eda67e02, devices=[Ljava.util.Map;@3c86a1b9, guestDiskMapping={8aba4520-e2f6-44c9-9eda-5d8e950cf539={name=/dev/sda}, 3f032b55-2d3a-4a64-b061-66c525093be2={name=/dev/sdc}, 0bba11aa-6502-4e49-b1f5-df7a5eb2cc33={name=/dev/sdb}, 64d458f2-2589-453f-a84a-8e070d06b543={name=/dev/vda}}}], log id: 4d9d8373
2018-04-11 13:02:24,131+03 INFO  [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedThreadFactory-engineThreadMonitoring-Thread-1) [] Thread pool 'commandCoordinator' is using 0 threads out of 10 and 10 tasks are waiting in the queue.

Comment 1 Benny Zlotnik 2018-06-05 13:12:03 UTC
Just a note while I am still looking into this:
If the storage was blocked during LSM the second attempt will most likely fail as there leftovers on the storage.
For obvious reasons, we cannot perform cleanup if the storage is not available.

Comment 2 Elad 2018-07-18 09:12:16 UTC
The exception in vdsm seems to be the same as in bug 1574631

Comment 3 Benny Zlotnik 2018-08-06 12:10:07 UTC
Might be, not sure it's the same root cause though

Comment 4 Sandro Bonazzola 2019-01-28 09:34:28 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 5 Benny Zlotnik 2019-03-12 11:21:48 UTC
I think this has been fixed, setting MODIFIED to retest

Comment 6 Shir Fishbain 2019-04-03 15:31:33 UTC
Half Verified 

The steps to reproduce are:

1. Create a VM with disks, OS and write data to disks
2. Run the VM 
3. Running LSM from gluster domain to iscsi domain
4. Run the iptables command: iptables -A OUTPUT -d 10.35.83.240 -j DROP (on gluster domain)
--------------The LSM failed-------------------
5. Waiting (I waited half an hour) for "Non-Operational" status for the host - (Doesn't appear)
6. Running iptables command: iptables -D OUTPUT -d 10.35.83.240 -j DROP (Restore the connection to gluster storage)
7. The LSM succeeded 

My question is why the host isn't in "Non-Operational" mode after LSM failure?

Logs attached

Comment 7 Shir Fishbain 2019-04-03 15:34:40 UTC
Created attachment 1551415 [details]
Logs

Comment 8 Shir Fishbain 2019-04-07 16:59:44 UTC
In each attempt to reproduce this bug, the VM moves to the following two modes:

1. Paused , ran on host_mixed_3
From engine log:
2019-04-07 18:01:41,724+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_PAUSED(1,025), VM shir_vm_1 has been paused.
2019-04-07 18:01:41,742+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_PAUSED_ERROR(139), VM shir_vm_1 has been paused due to unknown storage error.

2. Not responding, ran on host_mixed_2 
From engine log:
2019-04-07 18:51:44,894+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-5159) [4d019771] Failed to migrate VM 'shir_vm_2'
2019-04-07 18:52:22,058+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [101faad5] Lock Acquired to object 'EngineLock:{exclus
iveLocks='[ebe30960-efcd-4b81-8c6f-262b311893bb=PROVIDER]', sharedLocks=''}'
2019-04-07 18:52:22,079+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [101faad5] Running command: SyncNetworkProviderCommand
 internal: true.
2019-04-07 18:52:22,257+03 INFO  [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-50) [] User admin@internal successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search
 ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access
2019-04-07 18:52:22,507+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [101faad5] Lock freed to object 'EngineLock:{exclusive
Locks='[ebe30960-efcd-4b81-8c6f-262b311893bb=PROVIDER]', sharedLocks=''}'
2019-04-07 18:52:45,212+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM 'd5bf5866-2f72-4f0e-9997-60dc13d0004b'(shir_vm_2) moved from 'Paused' --> 'Up'
2019-04-07 18:52:45,346+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM shir_vm_2 has recovered from
 paused back to up.
2019-04-07 18:52:50,225+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-21) [] VM 'd5bf5866-2f72-4f0e-9997-60dc13d0004b'(shir_vm_2) moved from 'Up
' --> 'NotResponding'
2019-04-07 18:52:50,243+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-21) [] EVENT_ID: VM_NOT_RESPONDING(126), VM shir_vm_2 is not responding.
2019-04-07 18:54:35,730+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [] EVENT_ID: VM_NOT_RESPONDING(126), VM shir_vm_2 is not responding.


Steps to reproduce:
1. Create a VM with 2 gluster disks, OS and write to one of the disks
2. Running LSM from gluster domain to iscsi domain
3. Block connection between host running the VM and gluster storage (iptables -A OUTPUT -d 10.35.83.240,241,242 (3 nodes of the gluster domain) -j DROP) - host is moved to Non-operational
4. Restore connection to gluster storage - host is moved Up again

Actual results:
The VM moved to paused or "Not-Responding" status

Expected results:
The VM move up 

Additional info:
I can't finish verifying the bug, since the VM is not getting to "up" status in order to conduct LSM for the second time.

Comment 9 Shir Fishbain 2019-04-07 17:01:06 UTC
Created attachment 1553285 [details]
logs_1566471

Comment 10 Avihai 2019-04-07 17:45:16 UTC
Shir, I think the best thing to do is open a new bug on the issues you see comment #8 and mark the new bug as 'blocks' this bug.

Comment 11 Avihai 2019-04-07 17:47:27 UTC
(In reply to Avihai from comment #10)
> Shir, I think the best thing to do is open a new bug on the issues you see
> comment #8 and mark the new bug as 'blocks' this bug.

Then, we(Tal) will retarget this bug according to the blocker bug and when the blocker bug is fixed we can also verify this bug.

Benny, what do you think ?

Comment 12 Shir Fishbain 2019-04-08 08:52:33 UTC
This bug is now defined as "blocks"
I can't finish verifying the bug, since the VM is not getting to "up" status in order to conduct LSM for the second time.
I have opened new bug on this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1697261

Comment 13 Sandro Bonazzola 2019-06-14 09:58:14 UTC
Is this fixed in ovirt-engine-4.3.5?

Comment 14 Benny Zlotnik 2019-06-18 07:31:49 UTC
(In reply to Sandro Bonazzola from comment #13)
> Is this fixed in ovirt-engine-4.3.5?

It is currently blocked on another bug

Comment 15 Sandro Bonazzola 2019-06-19 15:47:46 UTC
Re-targeting to 4.3.6 since this bug is blocked by bug #1697261 which has been re-targeted to 4.3.6 yesterday.
If this needs to block 4.3.5 release please move both this and bug #1697261 back to 4.3.5

Comment 16 Avihai 2019-07-03 08:28:00 UTC
(In reply to Sandro Bonazzola from comment #15)
> Re-targeting to 4.3.6 since this bug is blocked by bug #1697261 which has
> been re-targeted to 4.3.6 yesterday.
> If this needs to block 4.3.5 release please move both this and bug #1697261
> back to 4.3.5

Tal, see Sandro Comment#16 , this bug is blocked by bug #1697261 which has been re-targeted to 4.3.6.

Please retargeted to 4.3.5

Comment 17 Avihai 2019-07-03 08:28:32 UTC
(In reply to Avihai from comment #16)
> (In reply to Sandro Bonazzola from comment #15)
> > Re-targeting to 4.3.6 since this bug is blocked by bug #1697261 which has
> > been re-targeted to 4.3.6 yesterday.
> > If this needs to block 4.3.5 release please move both this and bug #1697261
> > back to 4.3.5
> 
> Tal, see Sandro Comment#16 , this bug is blocked by bug #1697261 which has
> been re-targeted to 4.3.6.
> 
> Please retargeted to 4.3.5
I meant  retarget to 4.3.6

Comment 18 RHV bug bot 2019-07-04 13:10:56 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: infra

Comment 19 RHV bug bot 2019-08-02 17:21:15 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: infra

Comment 20 RHV bug bot 2019-08-08 13:17:35 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: infra

Comment 21 RHV bug bot 2019-08-15 14:05:02 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: infra

Comment 22 RHV bug bot 2019-08-23 13:25:53 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: infra

Comment 23 Avihai 2019-08-25 06:15:35 UTC
Tal, this bug is blocked by bug 1697261( see Comment 12) which has been re-targeted to 4.4.
Please retarget this bug as 4.4 as well.

Comment 27 Benny Zlotnik 2020-05-11 10:35:19 UTC
I'm closing this bug, there were a lot of changes in the relevant flows since the report. Open a new bug on a current version if it's still relevant.


Note You need to log in before you can comment on or make changes to this bug.