Bug 1808850

Summary: Failed to create a new disk or copy template disk on iscsi domains due to OVF failure in ILLEGAL state
Product: [oVirt] vdsm Reporter: Yosi Ben Shimon <ybenshim>
Component: GeneralAssignee: Amit Bawer <abawer>
Status: CLOSED WORKSFORME QA Contact: Evelina Shames <eshames>
Severity: urgent Docs Contact:
Priority: urgent    
Version: ---CC: abawer, aefrat, bugs, bzlotnik, eshames, izuckerm, michal.skrivanek, mtessun, mzamazal, nsoffer
Target Milestone: ovirt-4.4.0Keywords: Automation, Regression, TestBlocker
Target Release: 4.40.7Flags: pm-rhel: ovirt-4.4+
pm-rhel: blocker?
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.40.7 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-01 07:31:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
2nd_rep_storage-ge-01_env
none
rhv-4.4.0-26 reproduction engine/vdsm logs none

Description Yosi Ben Shimon 2020-03-01 15:46:34 UTC
Description of problem:
Clean environment reprovision left the environment with 2 storage domains (iscsi_0 and iscsi_2) with OVF disks in ILLEGAL state.
This causes failures when trying to copy template disks from other storage domains as well as creating new disks.

New disk creation leaves the disk in LOCKED state and errors in the engine log.

For example (engine log):

2020-03-01 17:40:05,588+02 INFO  [org.ovirt.engine.core.bll.storage.disk.AddDiskCommand] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] Running command: AddDiskCommand internal: false. Entities affected :  ID: 74e5e8fa-29c2-483b-932c-84d769ae844b Type: StorageAction group CREATE_DISK with role type USER
2020-03-01 17:40:05,649+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.AddImageFromScratchCommand] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] Running command: AddImageFromScratchCommand internal: true. Entities affected :  ID: 74e5e8fa-29c2-483b-932c-84d769ae844b Type: Storage
2020-03-01 17:40:05,705+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] START, CreateImageVDSCommand( CreateImageVDSCommandParameters:{storagePoolId='08ded4b0-79f0-45e5-bb7a-08997c737e06', ignoreFailoverLimit='false', storageDomainId='74e5e8fa-29c2-483b-932c-84d769ae844b', imageGroupId='ea382a8a-e29f-49ca-8048-e4cb65fb95f9', imageSizeInBytes='1073741824', volumeFormat='RAW', newImageId='29babeb6-fa10-4499-9ca0-9b5b3af45284', imageType='Preallocated', newImageDescription='{"DiskAlias":"new_disk","DiskDescription":""}', imageInitialSizeInBytes='0'}), log id: 328595ec
2020-03-01 17:40:05,707+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] -- executeIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
2020-03-01 17:40:05,859+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] FINISH, CreateImageVDSCommand, return: 29babeb6-fa10-4499-9ca0-9b5b3af45284, log id: 328595ec
2020-03-01 17:40:05,870+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command 'c028cef9-a42a-49c1-874c-1b491b8d2887'
2020-03-01 17:40:05,870+02 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] CommandMultiAsyncTasks::attachTask: Attaching task '60b0389f-004a-4bd7-aa0c-9094e1ee8870' to command 'c028cef9-a42a-49c1-874c-1b491b8d2887'.
2020-03-01 17:40:05,910+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] Adding task '60b0389f-004a-4bd7-aa0c-9094e1ee8870' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet..
2020-03-01 17:40:05,938+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] BaseAsyncTask::startPollingTask: Starting to poll task '60b0389f-004a-4bd7-aa0c-9094e1ee8870'.
2020-03-01 17:40:05,984+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-35) [2117142c-fded-4059-816e-2f0f28cd17b5] EVENT_ID: USER_ADD_DISK(2,020), Add-Disk operation of 'new_disk' was initiated by admin@internal-authz.
2020-03-01 17:40:06,536+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [2117142c-fded-4059-816e-2f0f28cd17b5] Command 'AddDisk' (id: 'c4bc804f-05c4-4358-8c75-0b8e773b2a37') waiting on child command id: 'c028cef9-a42a-49c1-874c-1b491b8d2887' type:'AddImageFromScratch' to complete
2020-03-01 17:40:07,574+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [] Polling and updating Async Tasks: 1 tasks, 1 tasks to poll now
2020-03-01 17:40:07,595+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [] Failed in 'HSMGetAllTasksStatusesVDS' method
2020-03-01 17:40:07,625+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command HSMGetAllTasksStatusesVDS failed: value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100
2020-03-01 17:40:07,626+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [] SPMAsyncTask::PollTask: Polling task '60b0389f-004a-4bd7-aa0c-9094e1ee8870' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'.
2020-03-01 17:40:07,665+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [] BaseAsyncTask::logEndTaskFailure: Task '60b0389f-004a-4bd7-aa0c-9094e1ee8870' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100, code = 100',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100, code = 100'
2020-03-01 17:40:07,718+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.AddImageFromScratchCommand] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] Command [id=c028cef9-a42a-49c1-874c-1b491b8d2887]: Updating status to 'FAILED', The command end method logic will be executed by one of its parent commands.
2020-03-01 17:40:07,719+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'AddImageFromScratch' completed, handling the result.
2020-03-01 17:40:07,719+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'AddImageFromScratch' succeeded, clearing tasks.
2020-03-01 17:40:07,720+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] SPMAsyncTask::ClearAsyncTask: Attempting to clear task '60b0389f-004a-4bd7-aa0c-9094e1ee8870'
2020-03-01 17:40:07,721+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] START, SPMClearTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='08ded4b0-79f0-45e5-bb7a-08997c737e06', ignoreFailoverLimit='false', taskId='60b0389f-004a-4bd7-aa0c-9094e1ee8870'}), log id: 42e54a11
2020-03-01 17:40:07,722+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] START, HSMClearTaskVDSCommand(HostName = host_mixed_3, HSMTaskGuidBaseVDSCommandParameters:{hostId='03869029-fc9d-4ff8-80b6-9504306bad1c', taskId='60b0389f-004a-4bd7-aa0c-9094e1ee8870'}), log id: 32f5c5c3
2020-03-01 17:40:07,740+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] FINISH, HSMClearTaskVDSCommand, return: , log id: 32f5c5c3
2020-03-01 17:40:07,741+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] FINISH, SPMClearTaskVDSCommand, return: , log id: 42e54a11
2020-03-01 17:40:07,748+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] BaseAsyncTask::removeTaskFromDB: Removed task '60b0389f-004a-4bd7-aa0c-9094e1ee8870' from DataBase
2020-03-01 17:40:07,749+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-3157) [2117142c-fded-4059-816e-2f0f28cd17b5] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity 'c028cef9-a42a-49c1-874c-1b491b8d2887'
2020-03-01 17:40:08,541+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [2117142c-fded-4059-816e-2f0f28cd17b5] Command 'AddDisk' id: 'c4bc804f-05c4-4358-8c75-0b8e773b2a37' child commands '[c028cef9-a42a-49c1-874c-1b491b8d2887]' executions were completed, status 'FAILED'
2020-03-01 17:40:09,561+02 ERROR [org.ovirt.engine.core.bll.storage.disk.AddDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-20) [2117142c-fded-4059-816e-2f0f28cd17b5] Ending command 'org.ovirt.engine.core.bll.storage.disk.AddDiskCommand' with failure.
2020-03-01 17:40:09,571+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.AddImageFromScratchCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-20) [2117142c-fded-4059-816e-2f0f28cd17b5] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.AddImageFromScratchCommand' with failure.


Version-Release number of selected component (if applicable):


How reproducible:
ovirt-engine-4.4.0-0.24.master.el8ev.noarch

Steps to Reproduce:
1.
2.
3.

Actual results:
Two OVF in storage domains (iscsi_0 and iscsi_2) are in ILLEGAL state.

Expected results:
The OVFs should be in OK state and the storage domains should be functional

Additional info:

Comment 1 Benny Zlotnik 2020-03-01 15:57:36 UTC
What is the exact flow?
Does this reproduce every time?
Are there debug level vdsm logs?

Comment 4 Nir Soffer 2020-03-01 23:06:08 UTC
Yosi, without vdsm log we have no choice but closing this bug.

Comment 5 Ilan Zuckerman 2020-03-03 07:24:47 UTC
(In reply to Nir Soffer from comment #4)
> Yosi, without vdsm log we have no choice but closing this bug.

This issue occured on other envs as well.
Attaching the link for log-collector logs from one such env with illegal status OVFs
https://drive.google.com/file/d/1N0XX8i9CoHOwk2fWA95ZD4Kta0Yqpuiz/view?usp=sharing

Comment 6 Avihai 2020-03-04 14:16:15 UTC
Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01) that was rebuild to same build and saw the same issue with the same scenario.

VDSM logs:
This is seen 8 minutes before the issue occurs:
2020-03-01 16:36:35,204+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.setMOMPolicyParameters succeeded in 0.14 seconds (__init__:312)
2020-03-01 16:36:35,253+0200 INFO  (jsonrpc/7) [api.host] START hostdevListByCaps(caps=None) from=::ffff:10.35.161.181,57314, flow_id=7623ad29 (api:48)
2020-03-01 16:36:35,369+0200 ERROR (jsonrpc/7) [root] Could not read system udev path -> block path mapping: a bytes-like object is required, not 'str' (hostdev:513)

This is correlated to the issue:

2020-03-01 16:44:33,607+0200 WARN  (tasks/6) [storage.ResourceManager] Resource factory failed to create resource '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458-c1dc5259a2ec'. Canceling request. (resourceManager:524)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 520, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176, in getChain
    uuidlist = volclass.getImageVolumes(sdUUID, imgUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476, in getImageVolumes
    return cls.manifestClass.getImageVolumes(sdUUID, imgUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 316, in getImageVolumes
    lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in lvsByTag
    return [lv for lv in getLV(vgName) if tag in lv.tags]
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in getLV
    raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',)
2020-03-01 16:44:33,608+0200 WARN  (tasks/6) [storage.ResourceManager.Request] (ResName='01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458-c1dc5259a2ec', ReqID='1bde6903-84cf-43b1-9773-b682b16a6477') Tried to cancel a processed request (resourceManager:188)
2020-03-01 16:44:33,611+0200 ERROR (tasks/6) [storage.TaskManager.Task] (Task='17a3b5fa-10db-43ae-a5c1-fbcafde94322') Unexpected error (task:874)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 448, in acquireResource
    request.cancel()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 189, in cancel
    raise RequestAlreadyProcessedError("Cannot cancel a processed "
vdsm.storage.resourceManager.RequestAlreadyProcessedError: Cannot cancel a processed request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 881, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 344, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1869, in createVolume
    with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 829, in acquireResource
    return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 456, in acquireResource
    raise se.ResourceAcqusitionFailed()
vdsm.storage.exception.ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: ()

Engine:
2020-03-01 16:44:35,344+02 INFO  [org.ovirt.engine.core.bll.storage.domain.ActivateStorageDomainCommand] (default task-13) [2418f982] ActivateStorage Domain. After Connect all hosts to pool. Time: Sun Mar 01 16
:44:35 IST 2020
2020-03-01 16:44:35,375+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] Failed in 'HSMGetAllTasksStatusesVDS' method
2020-03-01 16:44:35,398+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command HSMGetAllTasksStatusesVDS failed: value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100
2020-03-01 16:44:35,400+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] SPMAsyncTask::PollTask: Polling task '17a3b5fa-10db-43ae-a5c1-fbcafde94322' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'.
2020-03-01 16:44:35,404+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] BaseAsyncTask::logEndTaskFailure: Task '17a3b5fa-10db-43ae-a5c1-fbcafde94322' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100, code = 100',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100, code = 100'

Comment 7 Avihai 2020-03-04 14:17:18 UTC
Created attachment 1667502 [details]
2nd_rep_storage-ge-01_env

Comment 8 Nir Soffer 2020-03-04 14:57:44 UTC
(In reply to Avihai from comment #6)
> This is seen 8 minutes before the issue occurs:
> 2020-03-01 16:36:35,204+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC
> call Host.setMOMPolicyParameters succeeded in 0.14 seconds (__init__:312)
> 2020-03-01 16:36:35,253+0200 INFO  (jsonrpc/7) [api.host] START
> hostdevListByCaps(caps=None) from=::ffff:10.35.161.181,57314,
> flow_id=7623ad29 (api:48)
> 2020-03-01 16:36:35,369+0200 ERROR (jsonrpc/7) [root] Could not read system
> udev path -> block path mapping: a bytes-like object is required, not 'str'
> (hostdev:513)

This is virt bug. Milan, can you check this?

Comment 9 Nir Soffer 2020-03-04 15:01:13 UTC
(In reply to Avihai from comment #6)
> Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01)
> that was rebuild to same build and saw the same issue with the same scenario.
...
> 2020-03-01 16:44:33,607+0200 WARN  (tasks/6) [storage.ResourceManager]
> Resource factory failed to create resource
> '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458-
> c1dc5259a2ec'. Canceling request. (resourceManager:524)
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py",
> line 520, in registerResource
>     obj = namespaceObj.factory.createResource(name, lockType)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py",
> line 193, in createResource
>     lockType)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py",
> line 122, in __getResourceCandidatesList
>     imgUUID=resourceName)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176,
> in getChain
>     uuidlist = volclass.getImageVolumes(sdUUID, imgUUID)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476,
> in getImageVolumes
>     return cls.manifestClass.getImageVolumes(sdUUID, imgUUID)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line
> 316, in getImageVolumes
>     lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID))
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in
> lvsByTag
>     return [lv for lv in getLV(vgName) if tag in lv.tags]
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in
> getLV
>     raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
> vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does
> not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',)

This looks like a bug in lvm.getLV. When getLV() is called without specifying
lv names, it should return list of all lvs, but here the code treated the default
value (None) as an lv name.

It may be related to recent changes in lvm module.

Amit, can you check this?

Comment 10 Benny Zlotnik 2020-03-04 15:04:48 UTC
(In reply to Nir Soffer from comment #9)
> (In reply to Avihai from comment #6)
> > Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01)
> > that was rebuild to same build and saw the same issue with the same scenario.
> ...
> > 2020-03-01 16:44:33,607+0200 WARN  (tasks/6) [storage.ResourceManager]
> > Resource factory failed to create resource
> > '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458-
> > c1dc5259a2ec'. Canceling request. (resourceManager:524)
> > Traceback (most recent call last):
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py",
> > line 520, in registerResource
> >     obj = namespaceObj.factory.createResource(name, lockType)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py",
> > line 193, in createResource
> >     lockType)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py",
> > line 122, in __getResourceCandidatesList
> >     imgUUID=resourceName)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176,
> > in getChain
> >     uuidlist = volclass.getImageVolumes(sdUUID, imgUUID)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476,
> > in getImageVolumes
> >     return cls.manifestClass.getImageVolumes(sdUUID, imgUUID)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line
> > 316, in getImageVolumes
> >     lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID))
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in
> > lvsByTag
> >     return [lv for lv in getLV(vgName) if tag in lv.tags]
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in
> > getLV
> >     raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
> > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does
> > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',)
> 
> This looks like a bug in lvm.getLV. When getLV() is called without specifying
> lv names, it should return list of all lvs, but here the code treated the
> default
> value (None) as an lv name.
> 
> It may be related to recent changes in lvm module.
> 
> Amit, can you check this?

I think there is an issue with devices, adding my reply from the mail thread:

This is the relevant errors:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 520, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176, in getChain
    uuidlist = volclass.getImageVolumes(sdUUID, imgUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476, in getImageVolumes
    return cls.manifestClass.getImageVolumes(sdUUID, imgUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 316, in getImageVolumes
    lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in lvsByTag
    return [lv for lv in getLV(vgName) if tag in lv.tags]
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in getLV
    raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',)

Based on reading the code, this means `lvs` didn't return anything vdsm could parse?
If I run lvs/vgs manually there are warnings:
$ lvs
  WARNING: Not using device /dev/sdf for PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua.
  WARNING: Not using device /dev/mapper/360002ac0000000000000001500021f6b for PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg.
  WARNING: PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua prefers device /dev/mapper/360002ac0000000000000001000021f6b because device name matches previous.
  WARNING: PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg prefers device /dev/sde because device is used by LV.

360002ac0000000000000001000021f6b is the one used by the VG 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7

If I manually try to create an LV:
$ lvcreate -L 100M -n foo 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7\
  Cannot update volume group 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 with duplicate PV devices.

Refreshing the iscsi connection fixes the issue (same as reactivating the host), which would mean the host might be seeing stale devices

Comment 11 Nir Soffer 2020-03-04 15:58:32 UTC
(In reply to Benny Zlotnik from comment #10)
> (In reply to Nir Soffer from comment #9)
...
> > >     raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
> > > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does
> > > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',)
> > 
> > This looks like a bug in lvm.getLV. When getLV() is called without specifying
> > lv names, it should return list of all lvs, but here the code treated the
> > default
> > value (None) as an lv name.
> > 
> > It may be related to recent changes in lvm module.
> > 
> > Amit, can you check this?
> 
> I think there is an issue with devices, adding my reply from the mail thread:
...
> 
> Based on reading the code, this means `lvs` didn't return anything vdsm
> could parse?

Possible. Looks like the error message is bad:

1117 def getLV(vgName, lvName=None):
1118     lv = _lvminfo.getLv(vgName, lvName)
1119     # getLV() should not return None
1120     if not lv:
1121         raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
1122     else:
1123         return lv

The error is relevant only when lvName != None. The real issue is hiding the error
in LVMCache.getLv() and returning bad value, converted to unhelpful exception
in lvm.getLV().

To understand this error we need to read vdsm logs before this call.

> If I run lvs/vgs manually there are warnings:
> $ lvs
>   WARNING: Not using device /dev/sdf for PV
> XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua.
>   WARNING: Not using device /dev/mapper/360002ac0000000000000001500021f6b
> for PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg.
>   WARNING: PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua prefers device
> /dev/mapper/360002ac0000000000000001000021f6b because device name matches
> previous.
>   WARNING: PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg prefers device /dev/sde
> because device is used by LV. 
> 360002ac0000000000000001000021f6b is the one used by the VG
> 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7
> 
> If I manually try to create an LV:
> $ lvcreate -L 100M -n foo 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7\
>   Cannot update volume group 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 with
> duplicate PV devices.
> 
> Refreshing the iscsi connection fixes the issue (same as reactivating the
> host), which would mean the host might be seeing stale devices

Is this issue reproducible?

Comment 12 Amit Bawer 2020-03-04 21:30:58 UTC
(In reply to Nir Soffer from comment #9)
> (In reply to Avihai from comment #6)
> > Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01)
> > that was rebuild to same build and saw the same issue with the same scenario.
> ...
> > 2020-03-01 16:44:33,607+0200 WARN  (tasks/6) [storage.ResourceManager]
> > Resource factory failed to create resource
> > '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458-
> > c1dc5259a2ec'. Canceling request. (resourceManager:524)
> > Traceback (most recent call last):
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py",
> > line 520, in registerResource
> >     obj = namespaceObj.factory.createResource(name, lockType)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py",
> > line 193, in createResource
> >     lockType)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py",
> > line 122, in __getResourceCandidatesList
> >     imgUUID=resourceName)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176,
> > in getChain
> >     uuidlist = volclass.getImageVolumes(sdUUID, imgUUID)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476,
> > in getImageVolumes
> >     return cls.manifestClass.getImageVolumes(sdUUID, imgUUID)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line
> > 316, in getImageVolumes
> >     lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID))
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in
> > lvsByTag
> >     return [lv for lv in getLV(vgName) if tag in lv.tags]
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in
> > getLV
> >     raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
> > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does
> > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',)
> 
> This looks like a bug in lvm.getLV. When getLV() is called without specifying
> lv names, it should return list of all lvs, but here the code treated the
> default
> value (None) as an lv name.
> 
> It may be related to recent changes in lvm module.
> 
> Amit, can you check this?

Backtrace showing that lvsByTags() asked for all lvs on vg=6f5f35ea-cb71-45c1-b96d-5606fc6d65b7, with lvName=None sent to lvm.getLV(...). 
lvm.getLV(6f5f35ea-cb71-45c1-b96d-5606fc6d65b7, None) invokes LVMCache.getLv(6f5f35ea-cb71-45c1-b96d-5606fc6d65b7, None) which should return a list of non-stale lvs belonging to this vg. The lv list cames back empty so lv==[] in lvm.getLV:1120, already cited on comment #11, and the exception is raised for non-existing vgName/lvName parameters (in which lvName was given as None to this flow).
 
It is hard to say from given log what caused the LVMCache return an empty list of lvs for that vg, if any lvs command used to update the cache has failed we should have seen a WARM message about it in the log (there are WARN messages about reloading VGs, but not for LVs). Maybe the pv devices errors/mixups in this env caused 'lvs' command to return with okay status for this VG but with no actual LVs, so it is expected to get an empty list from LVMCache.getLv(...) in this case since the cache itself is stale and we have no non-stale lvs to update from.

Comment 13 Benny Zlotnik 2020-03-05 07:24:21 UTC
(In reply to Nir Soffer from comment #11)
> (In reply to Benny Zlotnik from comment #10)
> > (In reply to Nir Soffer from comment #9)
> ...
> > > >     raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))
> > > > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does
> > > > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',)
> > > 
> > > This looks like a bug in lvm.getLV. When getLV() is called without specifying
> > > lv names, it should return list of all lvs, but here the code treated the
> > > default
> > > value (None) as an lv name.
> > > 
> > > It may be related to recent changes in lvm module.
> > > 
> > > Amit, can you check this?
> > 
> > I think there is an issue with devices, adding my reply from the mail thread:
> ...
> > 
> > Based on reading the code, this means `lvs` didn't return anything vdsm
> > could parse?
> 
> Possible. Looks like the error message is bad:
> 
> 1117 def getLV(vgName, lvName=None):
> 1118     lv = _lvminfo.getLv(vgName, lvName)
> 1119     # getLV() should not return None
> 1120     if not lv:
> 1121         raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName,
> lvName))
> 1122     else:
> 1123         return lv
> 
> The error is relevant only when lvName != None. The real issue is hiding the
> error
> in LVMCache.getLv() and returning bad value, converted to unhelpful exception
> in lvm.getLV().
> 
> To understand this error we need to read vdsm logs before this call.
> 
> > If I run lvs/vgs manually there are warnings:
> > $ lvs
> >   WARNING: Not using device /dev/sdf for PV
> > XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua.
> >   WARNING: Not using device /dev/mapper/360002ac0000000000000001500021f6b
> > for PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg.
> >   WARNING: PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua prefers device
> > /dev/mapper/360002ac0000000000000001000021f6b because device name matches
> > previous.
> >   WARNING: PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg prefers device /dev/sde
> > because device is used by LV. 
> > 360002ac0000000000000001000021f6b is the one used by the VG
> > 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7
> > 
> > If I manually try to create an LV:
> > $ lvcreate -L 100M -n foo 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7\
> >   Cannot update volume group 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 with
> > duplicate PV devices.
> > 
> > Refreshing the iscsi connection fixes the issue (same as reactivating the
> > host), which would mean the host might be seeing stale devices
> 
> Is this issue reproducible?

not manually

Comment 14 Milan Zamazal 2020-03-05 08:34:16 UTC
(In reply to Nir Soffer from comment #8)
> (In reply to Avihai from comment #6)
> > This is seen 8 minutes before the issue occurs:
> > 2020-03-01 16:36:35,204+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC
> > call Host.setMOMPolicyParameters succeeded in 0.14 seconds (__init__:312)
> > 2020-03-01 16:36:35,253+0200 INFO  (jsonrpc/7) [api.host] START
> > hostdevListByCaps(caps=None) from=::ffff:10.35.161.181,57314,
> > flow_id=7623ad29 (api:48)
> > 2020-03-01 16:36:35,369+0200 ERROR (jsonrpc/7) [root] Could not read system
> > udev path -> block path mapping: a bytes-like object is required, not 'str'
> > (hostdev:513)
> 
> This is virt bug. Milan, can you check this?

Yes, I'll fix it.

Comment 19 Martin Tessun 2020-03-10 12:59:40 UTC
What is of interest here is the device mapping during the failure. If re-login to the iSCSI-target gets rid of it, we seem to have an issue either with multipathd not recognizing all the paths or the access to the device (e.g. by bypassing multipath-device) or some stale lvm-caches.

So gathering an sosreport while the error is there might help in understanding how the devices look like in that stage.

Given the above error it looks to me as if the lvm is using one path only in a multipath environment thus getting the "duplicate PV" error message.

Comment 21 RHEL Program Management 2020-03-13 12:24:13 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 24 Avihai 2020-03-23 08:24:58 UTC
Seen again with LVM filters configured on all hosts with same scenario.
Logs attached.

ovirt-engine-4.4.0-0.26.master.el8ev.noarch
vdsm-4.40.7-1.el8ev.x86_64

LVM filters are configured on all hosts:
[root@lynx24 vdsm]# vdsm-tool config-lvm-filter
Analyzing host...
LVM filter is already configured for Vdsm

Events:
Mar 21, 2020, 9:42:54 PM User admin@internal-authz is copying disk latest-rhel-guest-image-7.7-infra to domain iscsi_0.
Mar 21, 2020, 9:43:00 PM VDSM host_mixed_1 command HSMGetAllTasksStatusesVDS failed: value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\'  Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550



Copy to ISCS SD:
Engine:
2020-03-21 21:43:00,323+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] Polling and updating Async Tasks: 2 tasks, 1 tasks to poll now
2020-03-21 21:43:00,328+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] Failed in 'HSMGetAllTasksStatusesVDS' method
2020-03-21 21:43:00,332+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command HSMGetAllTasksStatusesVDS failed: value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\'  Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550
2020-03-21 21:43:00,333+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] SPMAsyncTask::PollTask: Polling task 'c7d3ec52-156c-4077-b9da-bfbedc1278b1' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'.
2020-03-21 21:43:00,335+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] BaseAsyncTask::logEndTaskFailure: Task 'c7d3ec52-156c-4077-b9da-bfbedc1278b1' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\'  Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550, code = 550',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\'  Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550, code = 550'
2020-03-21 21:43:00,336+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] CommandAsyncTask::endActionIfNecessary: All tasks of command 'e2f495fc-9c21-47fa-b035-24fa39f689ea' has ended -> executing 'endAction'
2020-03-21 21:43:00,336+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: 'e2f495fc-9c21-47fa-b035-24fa39f689ea'): calling endAction '.
2020-03-21 21:43:00,337+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-13224) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'CreateVolumeContainer',
2020-03-21 21:43:00,341+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (EE-ManagedThreadFactory-engine-Thread-13224) [a4d7bedb-0782-4324-be0e-a3b13e4b37bc] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand' with failure.

VDSM log:
2020-03-21 21:42:57,373+0200 ERROR (tasks/3) [storage.Volume] Failed to create volume /rhev/data-center/mnt/blockSD/2eab638c-d8aa-4a2d-9710-aaaa9b64c410/images/a5d54a36-40da-46c2-ac35-71c714e0a633/d1fee325-96ff
-40fb-93ec-49245f3c1791: Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\'  Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" alr
eady exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' (volume:1166)
2020-03-21 21:42:57,373+0200 ERROR (tasks/3) [storage.Volume] Unexpected error (volume:1202)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create
    initial_size=initial_size)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 503, in _create
    initialTags=(sc.TAG_VOL_UNINIT,))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1406, in createLV
    raise se.CannotCreateLogicalVolume(vgName, lvName, err)
vdsm.storage.exception.CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\'  Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']'
2020-03-21 21:42:57,374+0200 ERROR (tasks/3) [storage.TaskManager.Task] (Task='c7d3ec52-156c-4077-b9da-bfbedc1278b1') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1874, in createVolume
    initial_size=initialSize)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 971, in createVolume
    initial_size=initial_size)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create
    initial_size=initial_size)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 503, in _create
    initialTags=(sc.TAG_VOL_UNINIT,))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1406, in createLV
    raise se.CannotCreateLogicalVolume(vgName, lvName, err)
vdsm.storage.exception.CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\'  Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']'

Comment 25 Avihai 2020-03-23 08:26:13 UTC
Created attachment 1672570 [details]
rhv-4.4.0-26 reproduction engine/vdsm logs

Comment 26 Avihai 2020-03-24 08:41:00 UTC
Just to clarify after the issue is seen in at a storage domain creating a new qcow floating disk fails forever on that storage domain.
This reproduce 100% also via webadmin.

Comment 27 Amit Bawer 2020-03-31 09:48:19 UTC
(In reply to Avihai from comment #26)
> Just to clarify after the issue is seen in at a storage domain creating a
> new qcow floating disk fails forever on that storage domain.
> This reproduce 100% also via webadmin.

What vdsm version is currently in use? no version appears on the vdsm.log itself from comment #25
it should be later than 4.40.8

Comment 28 Amit Bawer 2020-03-31 10:47:09 UTC
OTOH, i see exception for an already existing file volume during creation in log of comment #25
which will probably not relate to fixes per 4.40.8 and later


2020-03-21 21:42:06,696+0200 INFO  (tasks/4) [storage.Volume] Creating volume d1fee325-96ff-40fb-93ec-49245f3c1791 (volume:1142)
2020-03-21 21:42:06,735+0200 ERROR (tasks/4) [storage.Volume] Failed to create volume /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__6__nfs__3/16483589-b481-49ed-8f06-596df6330afb/images/a5d54a36-40da-46c2-ac35-71c714e0a633/d1fee325-96ff-40fb-93ec-49245f3c1791: Volume already exists: ('d1fee325-96ff-40fb-93ec-49245f3c1791',) (volume:1166)
2020-03-21 21:42:06,735+0200 ERROR (tasks/4) [storage.Volume] Unexpected error (volume:1202)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 542, in _truncate_volume
    creatExcl=True)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 347, in truncateFile
    ioproc.truncate(path, size, mode if mode is not None else 0, creatExcl)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 576, in truncate
    self.timeout)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 448, in _sendCommand
    raise OSError(errcode, errstr)
FileExistsError: [Errno 17] File exists

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create
    initial_size=initial_size)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 457, in _create
    imgUUID, srcImgUUID, srcVolUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 513, in _create_cow_volume
    cls._truncate_volume(vol_path, 0, vol_id, dom)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 545, in _truncate_volume
    raise se.VolumeAlreadyExists(vol_id)
vdsm.storage.exception.VolumeAlreadyExists: Volume already exists: ('d1fee325-96ff-40fb-93ec-49245f3c1791',)
2020-03-21 21:42:06,735+0200 ERROR (tasks/4) [storage.TaskManager.Task] (Task='0840b2d9-9f0c-4e92-ab10-a89befbcc899') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 542, in _truncate_volume
    creatExcl=True)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 347, in truncateFile
    ioproc.truncate(path, size, mode if mode is not None else 0, creatExcl)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 576, in truncate
    self.timeout)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 448, in _sendCommand
    raise OSError(errcode, errstr)
FileExistsError: [Errno 17] File exists

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1874, in createVolume
    initial_size=initialSize)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 971, in createVolume
    initial_size=initial_size)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create
    initial_size=initial_size)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 457, in _create
    imgUUID, srcImgUUID, srcVolUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 513, in _create_cow_volume
    cls._truncate_volume(vol_path, 0, vol_id, dom)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 545, in _truncate_volume
    raise se.VolumeAlreadyExists(vol_id)
vdsm.storage.exception.VolumeAlreadyExists: Volume already exists: ('d1fee325-96ff-40fb-93ec-49245f3c1791',)

Comment 29 Yosi Ben Shimon 2020-03-31 11:33:17 UTC
(In reply to Amit Bawer from comment #27)
> (In reply to Avihai from comment #26)
> > Just to clarify after the issue is seen in at a storage domain creating a
> > new qcow floating disk fails forever on that storage domain.
> > This reproduce 100% also via webadmin.
> 
> What vdsm version is currently in use? no version appears on the vdsm.log
> itself from comment #25
> it should be later than 4.40.8

I did not see this issue in the latest versions.
Tested it today on:
ovirt-engine-4.4.0-0.26.master.el8ev.noarch
vdsm-4.40.7-1.el8ev.x86_64

and didn't find any issues.

Comment 30 Amit Bawer 2020-03-31 11:53:13 UTC
(In reply to Yosi Ben Shimon from comment #29)
> (In reply to Amit Bawer from comment #27)
> > (In reply to Avihai from comment #26)
> > > Just to clarify after the issue is seen in at a storage domain creating a
> > > new qcow floating disk fails forever on that storage domain.
> > > This reproduce 100% also via webadmin.
> > 
> > What vdsm version is currently in use? no version appears on the vdsm.log
> > itself from comment #25
> > it should be later than 4.40.8
> 
> I did not see this issue in the latest versions.
> Tested it today on:
> ovirt-engine-4.4.0-0.26.master.el8ev.noarch
> vdsm-4.40.7-1.el8ev.x86_64
> 
> and didn't find any issues.

Did you try this with nfs domain?
The exception for the already existing volume in comment #28 is for creating cow (thin) file volume over nfs storage.

Comment 31 Yosi Ben Shimon 2020-03-31 12:11:09 UTC
(In reply to Amit Bawer from comment #30)
> (In reply to Yosi Ben Shimon from comment #29)
> > (In reply to Amit Bawer from comment #27)
> > > (In reply to Avihai from comment #26)
> > > > Just to clarify after the issue is seen in at a storage domain creating a
> > > > new qcow floating disk fails forever on that storage domain.
> > > > This reproduce 100% also via webadmin.
> > > 
> > > What vdsm version is currently in use? no version appears on the vdsm.log
> > > itself from comment #25
> > > it should be later than 4.40.8
> > 
> > I did not see this issue in the latest versions.
> > Tested it today on:
> > ovirt-engine-4.4.0-0.26.master.el8ev.noarch
> > vdsm-4.40.7-1.el8ev.x86_64
> > 
> > and didn't find any issues.
> 
> Did you try this with nfs domain?
> The exception for the already existing volume in comment #28 is for creating
> cow (thin) file volume over nfs storage.

Yes.
Tried both nfs and iscsi domains.
No errors or exceptions in VDSM logs.

Comment 32 Amit Bawer 2020-03-31 12:46:33 UTC
(In reply to Yosi Ben Shimon from comment #31)
> (In reply to Amit Bawer from comment #30)
> > (In reply to Yosi Ben Shimon from comment #29)
> > > (In reply to Amit Bawer from comment #27)
> > > > (In reply to Avihai from comment #26)
> > > > > Just to clarify after the issue is seen in at a storage domain creating a
> > > > > new qcow floating disk fails forever on that storage domain.
> > > > > This reproduce 100% also via webadmin.
> > > > 
> > > > What vdsm version is currently in use? no version appears on the vdsm.log
> > > > itself from comment #25
> > > > it should be later than 4.40.8
> > > 
> > > I did not see this issue in the latest versions.
> > > Tested it today on:
> > > ovirt-engine-4.4.0-0.26.master.el8ev.noarch
> > > vdsm-4.40.7-1.el8ev.x86_64
> > > 
> > > and didn't find any issues.
> > 
> > Did you try this with nfs domain?
> > The exception for the already existing volume in comment #28 is for creating
> > cow (thin) file volume over nfs storage.
> 
> Yes.
> Tried both nfs and iscsi domains.
> No errors or exceptions in VDSM logs.

Thanks for re-checking, could not hit this either on nfs domain (vdsm-4.40.9-14.gitd83ac3bf2.el8.x86_64).
Unless someone is investigating one of the issues here, suggest to close this bz
and use more specific one if something comes up.

Comment 33 Yosi Ben Shimon 2020-04-01 07:31:28 UTC
Closing this bug.
If it will occur again, we'll investigate it and open a new one with all the relevant details.