Bug 1808850
Summary: | Failed to create a new disk or copy template disk on iscsi domains due to OVF failure in ILLEGAL state | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Yosi Ben Shimon <ybenshim> | ||||||
Component: | General | Assignee: | Amit Bawer <abawer> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Evelina Shames <eshames> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | --- | CC: | abawer, aefrat, bugs, bzlotnik, eshames, izuckerm, michal.skrivanek, mtessun, mzamazal, nsoffer | ||||||
Target Milestone: | ovirt-4.4.0 | Keywords: | Automation, Regression, TestBlocker | ||||||
Target Release: | 4.40.7 | Flags: | pm-rhel:
ovirt-4.4+
pm-rhel: blocker? |
||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | vdsm-4.40.7 | Doc Type: | No Doc Update | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-04-01 07:31:28 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Yosi Ben Shimon
2020-03-01 15:46:34 UTC
What is the exact flow? Does this reproduce every time? Are there debug level vdsm logs? Yosi, without vdsm log we have no choice but closing this bug. (In reply to Nir Soffer from comment #4) > Yosi, without vdsm log we have no choice but closing this bug. This issue occured on other envs as well. Attaching the link for log-collector logs from one such env with illegal status OVFs https://drive.google.com/file/d/1N0XX8i9CoHOwk2fWA95ZD4Kta0Yqpuiz/view?usp=sharing Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01) that was rebuild to same build and saw the same issue with the same scenario. VDSM logs: This is seen 8 minutes before the issue occurs: 2020-03-01 16:36:35,204+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.setMOMPolicyParameters succeeded in 0.14 seconds (__init__:312) 2020-03-01 16:36:35,253+0200 INFO (jsonrpc/7) [api.host] START hostdevListByCaps(caps=None) from=::ffff:10.35.161.181,57314, flow_id=7623ad29 (api:48) 2020-03-01 16:36:35,369+0200 ERROR (jsonrpc/7) [root] Could not read system udev path -> block path mapping: a bytes-like object is required, not 'str' (hostdev:513) This is correlated to the issue: 2020-03-01 16:44:33,607+0200 WARN (tasks/6) [storage.ResourceManager] Resource factory failed to create resource '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458-c1dc5259a2ec'. Canceling request. (resourceManager:524) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 520, in registerResource obj = namespaceObj.factory.createResource(name, lockType) File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource lockType) File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList imgUUID=resourceName) File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176, in getChain uuidlist = volclass.getImageVolumes(sdUUID, imgUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476, in getImageVolumes return cls.manifestClass.getImageVolumes(sdUUID, imgUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 316, in getImageVolumes lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID)) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in lvsByTag return [lv for lv in getLV(vgName) if tag in lv.tags] File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in getLV raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',) 2020-03-01 16:44:33,608+0200 WARN (tasks/6) [storage.ResourceManager.Request] (ResName='01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458-c1dc5259a2ec', ReqID='1bde6903-84cf-43b1-9773-b682b16a6477') Tried to cancel a processed request (resourceManager:188) 2020-03-01 16:44:33,611+0200 ERROR (tasks/6) [storage.TaskManager.Task] (Task='17a3b5fa-10db-43ae-a5c1-fbcafde94322') Unexpected error (task:874) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 448, in acquireResource request.cancel() File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 189, in cancel raise RequestAlreadyProcessedError("Cannot cancel a processed " vdsm.storage.resourceManager.RequestAlreadyProcessedError: Cannot cancel a processed request During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 881, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 344, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1869, in createVolume with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE): File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 829, in acquireResource return _manager.acquireResource(namespace, name, lockType, timeout=timeout) File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 456, in acquireResource raise se.ResourceAcqusitionFailed() vdsm.storage.exception.ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: () Engine: 2020-03-01 16:44:35,344+02 INFO [org.ovirt.engine.core.bll.storage.domain.ActivateStorageDomainCommand] (default task-13) [2418f982] ActivateStorage Domain. After Connect all hosts to pool. Time: Sun Mar 01 16 :44:35 IST 2020 2020-03-01 16:44:35,375+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] Failed in 'HSMGetAllTasksStatusesVDS' method 2020-03-01 16:44:35,398+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command HSMGetAllTasksStatusesVDS failed: value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100 2020-03-01 16:44:35,400+02 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] SPMAsyncTask::PollTask: Polling task '17a3b5fa-10db-43ae-a5c1-fbcafde94322' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'. 2020-03-01 16:44:35,404+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] BaseAsyncTask::logEndTaskFailure: Task '17a3b5fa-10db-43ae-a5c1-fbcafde94322' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100, code = 100', -- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Could not acquire resource. Probably resource factory threw an exception.: () abortedcode=100, code = 100' Created attachment 1667502 [details]
2nd_rep_storage-ge-01_env
(In reply to Avihai from comment #6) > This is seen 8 minutes before the issue occurs: > 2020-03-01 16:36:35,204+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC > call Host.setMOMPolicyParameters succeeded in 0.14 seconds (__init__:312) > 2020-03-01 16:36:35,253+0200 INFO (jsonrpc/7) [api.host] START > hostdevListByCaps(caps=None) from=::ffff:10.35.161.181,57314, > flow_id=7623ad29 (api:48) > 2020-03-01 16:36:35,369+0200 ERROR (jsonrpc/7) [root] Could not read system > udev path -> block path mapping: a bytes-like object is required, not 'str' > (hostdev:513) This is virt bug. Milan, can you check this? (In reply to Avihai from comment #6) > Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01) > that was rebuild to same build and saw the same issue with the same scenario. ... > 2020-03-01 16:44:33,607+0200 WARN (tasks/6) [storage.ResourceManager] > Resource factory failed to create resource > '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458- > c1dc5259a2ec'. Canceling request. (resourceManager:524) > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", > line 520, in registerResource > obj = namespaceObj.factory.createResource(name, lockType) > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", > line 193, in createResource > lockType) > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", > line 122, in __getResourceCandidatesList > imgUUID=resourceName) > File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176, > in getChain > uuidlist = volclass.getImageVolumes(sdUUID, imgUUID) > File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476, > in getImageVolumes > return cls.manifestClass.getImageVolumes(sdUUID, imgUUID) > File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line > 316, in getImageVolumes > lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID)) > File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in > lvsByTag > return [lv for lv in getLV(vgName) if tag in lv.tags] > File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in > getLV > raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',) This looks like a bug in lvm.getLV. When getLV() is called without specifying lv names, it should return list of all lvs, but here the code treated the default value (None) as an lv name. It may be related to recent changes in lvm module. Amit, can you check this? (In reply to Nir Soffer from comment #9) > (In reply to Avihai from comment #6) > > Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01) > > that was rebuild to same build and saw the same issue with the same scenario. > ... > > 2020-03-01 16:44:33,607+0200 WARN (tasks/6) [storage.ResourceManager] > > Resource factory failed to create resource > > '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458- > > c1dc5259a2ec'. Canceling request. (resourceManager:524) > > Traceback (most recent call last): > > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", > > line 520, in registerResource > > obj = namespaceObj.factory.createResource(name, lockType) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", > > line 193, in createResource > > lockType) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", > > line 122, in __getResourceCandidatesList > > imgUUID=resourceName) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176, > > in getChain > > uuidlist = volclass.getImageVolumes(sdUUID, imgUUID) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476, > > in getImageVolumes > > return cls.manifestClass.getImageVolumes(sdUUID, imgUUID) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line > > 316, in getImageVolumes > > lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID)) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in > > lvsByTag > > return [lv for lv in getLV(vgName) if tag in lv.tags] > > File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in > > getLV > > raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) > > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does > > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',) > > This looks like a bug in lvm.getLV. When getLV() is called without specifying > lv names, it should return list of all lvs, but here the code treated the > default > value (None) as an lv name. > > It may be related to recent changes in lvm module. > > Amit, can you check this? I think there is an issue with devices, adding my reply from the mail thread: This is the relevant errors: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 520, in registerResource obj = namespaceObj.factory.createResource(name, lockType) File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource lockType) File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList imgUUID=resourceName) File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176, in getChain uuidlist = volclass.getImageVolumes(sdUUID, imgUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476, in getImageVolumes return cls.manifestClass.getImageVolumes(sdUUID, imgUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 316, in getImageVolumes lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID)) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in lvsByTag return [lv for lv in getLV(vgName) if tag in lv.tags] File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in getLV raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',) Based on reading the code, this means `lvs` didn't return anything vdsm could parse? If I run lvs/vgs manually there are warnings: $ lvs WARNING: Not using device /dev/sdf for PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua. WARNING: Not using device /dev/mapper/360002ac0000000000000001500021f6b for PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg. WARNING: PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua prefers device /dev/mapper/360002ac0000000000000001000021f6b because device name matches previous. WARNING: PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg prefers device /dev/sde because device is used by LV. 360002ac0000000000000001000021f6b is the one used by the VG 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 If I manually try to create an LV: $ lvcreate -L 100M -n foo 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7\ Cannot update volume group 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 with duplicate PV devices. Refreshing the iscsi connection fixes the issue (same as reactivating the host), which would mean the host might be seeing stale devices (In reply to Benny Zlotnik from comment #10) > (In reply to Nir Soffer from comment #9) ... > > > raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) > > > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does > > > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',) > > > > This looks like a bug in lvm.getLV. When getLV() is called without specifying > > lv names, it should return list of all lvs, but here the code treated the > > default > > value (None) as an lv name. > > > > It may be related to recent changes in lvm module. > > > > Amit, can you check this? > > I think there is an issue with devices, adding my reply from the mail thread: ... > > Based on reading the code, this means `lvs` didn't return anything vdsm > could parse? Possible. Looks like the error message is bad: 1117 def getLV(vgName, lvName=None): 1118 lv = _lvminfo.getLv(vgName, lvName) 1119 # getLV() should not return None 1120 if not lv: 1121 raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) 1122 else: 1123 return lv The error is relevant only when lvName != None. The real issue is hiding the error in LVMCache.getLv() and returning bad value, converted to unhelpful exception in lvm.getLV(). To understand this error we need to read vdsm logs before this call. > If I run lvs/vgs manually there are warnings: > $ lvs > WARNING: Not using device /dev/sdf for PV > XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua. > WARNING: Not using device /dev/mapper/360002ac0000000000000001500021f6b > for PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg. > WARNING: PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua prefers device > /dev/mapper/360002ac0000000000000001000021f6b because device name matches > previous. > WARNING: PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg prefers device /dev/sde > because device is used by LV. > 360002ac0000000000000001000021f6b is the one used by the VG > 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 > > If I manually try to create an LV: > $ lvcreate -L 100M -n foo 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7\ > Cannot update volume group 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 with > duplicate PV devices. > > Refreshing the iscsi connection fixes the issue (same as reactivating the > host), which would mean the host might be seeing stale devices Is this issue reproducible? (In reply to Nir Soffer from comment #9) > (In reply to Avihai from comment #6) > > Adding VDSM logs printout from this issue seen on another ENV(storage-ge-01) > > that was rebuild to same build and saw the same issue with the same scenario. > ... > > 2020-03-01 16:44:33,607+0200 WARN (tasks/6) [storage.ResourceManager] > > Resource factory failed to create resource > > '01_img_6f5f35ea-cb71-45c1-b96d-5606fc6d65b7.62e8fe37-cf9a-45f2-8458- > > c1dc5259a2ec'. Canceling request. (resourceManager:524) > > Traceback (most recent call last): > > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", > > line 520, in registerResource > > obj = namespaceObj.factory.createResource(name, lockType) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", > > line 193, in createResource > > lockType) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", > > line 122, in __getResourceCandidatesList > > imgUUID=resourceName) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 176, > > in getChain > > uuidlist = volclass.getImageVolumes(sdUUID, imgUUID) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1476, > > in getImageVolumes > > return cls.manifestClass.getImageVolumes(sdUUID, imgUUID) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line > > 316, in getImageVolumes > > lvs = lvm.lvsByTag(sdUUID, "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID)) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1667, in > > lvsByTag > > return [lv for lv in getLV(vgName) if tag in lv.tags] > > File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1121, in > > getLV > > raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) > > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does > > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',) > > This looks like a bug in lvm.getLV. When getLV() is called without specifying > lv names, it should return list of all lvs, but here the code treated the > default > value (None) as an lv name. > > It may be related to recent changes in lvm module. > > Amit, can you check this? Backtrace showing that lvsByTags() asked for all lvs on vg=6f5f35ea-cb71-45c1-b96d-5606fc6d65b7, with lvName=None sent to lvm.getLV(...). lvm.getLV(6f5f35ea-cb71-45c1-b96d-5606fc6d65b7, None) invokes LVMCache.getLv(6f5f35ea-cb71-45c1-b96d-5606fc6d65b7, None) which should return a list of non-stale lvs belonging to this vg. The lv list cames back empty so lv==[] in lvm.getLV:1120, already cited on comment #11, and the exception is raised for non-existing vgName/lvName parameters (in which lvName was given as None to this flow). It is hard to say from given log what caused the LVMCache return an empty list of lvs for that vg, if any lvs command used to update the cache has failed we should have seen a WARM message about it in the log (there are WARN messages about reloading VGs, but not for LVs). Maybe the pv devices errors/mixups in this env caused 'lvs' command to return with okay status for this VG but with no actual LVs, so it is expected to get an empty list from LVMCache.getLv(...) in this case since the cache itself is stale and we have no non-stale lvs to update from. (In reply to Nir Soffer from comment #11) > (In reply to Benny Zlotnik from comment #10) > > (In reply to Nir Soffer from comment #9) > ... > > > > raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) > > > > vdsm.storage.exception.LogicalVolumeDoesNotExistError: Logical volume does > > > > not exist: ('6f5f35ea-cb71-45c1-b96d-5606fc6d65b7/None',) > > > > > > This looks like a bug in lvm.getLV. When getLV() is called without specifying > > > lv names, it should return list of all lvs, but here the code treated the > > > default > > > value (None) as an lv name. > > > > > > It may be related to recent changes in lvm module. > > > > > > Amit, can you check this? > > > > I think there is an issue with devices, adding my reply from the mail thread: > ... > > > > Based on reading the code, this means `lvs` didn't return anything vdsm > > could parse? > > Possible. Looks like the error message is bad: > > 1117 def getLV(vgName, lvName=None): > 1118 lv = _lvminfo.getLv(vgName, lvName) > 1119 # getLV() should not return None > 1120 if not lv: > 1121 raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, > lvName)) > 1122 else: > 1123 return lv > > The error is relevant only when lvName != None. The real issue is hiding the > error > in LVMCache.getLv() and returning bad value, converted to unhelpful exception > in lvm.getLV(). > > To understand this error we need to read vdsm logs before this call. > > > If I run lvs/vgs manually there are warnings: > > $ lvs > > WARNING: Not using device /dev/sdf for PV > > XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua. > > WARNING: Not using device /dev/mapper/360002ac0000000000000001500021f6b > > for PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg. > > WARNING: PV XQ2Ea4-5YJV-xLgV-b69w-k1o3-6DiK-JQ7Wua prefers device > > /dev/mapper/360002ac0000000000000001000021f6b because device name matches > > previous. > > WARNING: PV ZqW5kw-Uctu-LnIH-f7jT-1CeU-gXgG-e4dsRg prefers device /dev/sde > > because device is used by LV. > > 360002ac0000000000000001000021f6b is the one used by the VG > > 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 > > > > If I manually try to create an LV: > > $ lvcreate -L 100M -n foo 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7\ > > Cannot update volume group 6f5f35ea-cb71-45c1-b96d-5606fc6d65b7 with > > duplicate PV devices. > > > > Refreshing the iscsi connection fixes the issue (same as reactivating the > > host), which would mean the host might be seeing stale devices > > Is this issue reproducible? not manually (In reply to Nir Soffer from comment #8) > (In reply to Avihai from comment #6) > > This is seen 8 minutes before the issue occurs: > > 2020-03-01 16:36:35,204+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC > > call Host.setMOMPolicyParameters succeeded in 0.14 seconds (__init__:312) > > 2020-03-01 16:36:35,253+0200 INFO (jsonrpc/7) [api.host] START > > hostdevListByCaps(caps=None) from=::ffff:10.35.161.181,57314, > > flow_id=7623ad29 (api:48) > > 2020-03-01 16:36:35,369+0200 ERROR (jsonrpc/7) [root] Could not read system > > udev path -> block path mapping: a bytes-like object is required, not 'str' > > (hostdev:513) > > This is virt bug. Milan, can you check this? Yes, I'll fix it. What is of interest here is the device mapping during the failure. If re-login to the iSCSI-target gets rid of it, we seem to have an issue either with multipathd not recognizing all the paths or the access to the device (e.g. by bypassing multipath-device) or some stale lvm-caches. So gathering an sosreport while the error is there might help in understanding how the devices look like in that stage. Given the above error it looks to me as if the lvm is using one path only in a multipath environment thus getting the "duplicate PV" error message. This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. Seen again with LVM filters configured on all hosts with same scenario. Logs attached. ovirt-engine-4.4.0-0.26.master.el8ev.noarch vdsm-4.40.7-1.el8ev.x86_64 LVM filters are configured on all hosts: [root@lynx24 vdsm]# vdsm-tool config-lvm-filter Analyzing host... LVM filter is already configured for Vdsm Events: Mar 21, 2020, 9:42:54 PM User admin@internal-authz is copying disk latest-rhel-guest-image-7.7-infra to domain iscsi_0. Mar 21, 2020, 9:43:00 PM VDSM host_mixed_1 command HSMGetAllTasksStatusesVDS failed: value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\' Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550 Copy to ISCS SD: Engine: 2020-03-21 21:43:00,323+02 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] Polling and updating Async Tasks: 2 tasks, 1 tasks to poll now 2020-03-21 21:43:00,328+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] Failed in 'HSMGetAllTasksStatusesVDS' method 2020-03-21 21:43:00,332+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command HSMGetAllTasksStatusesVDS failed: value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\' Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550 2020-03-21 21:43:00,333+02 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] SPMAsyncTask::PollTask: Polling task 'c7d3ec52-156c-4077-b9da-bfbedc1278b1' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'. 2020-03-21 21:43:00,335+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] BaseAsyncTask::logEndTaskFailure: Task 'c7d3ec52-156c-4077-b9da-bfbedc1278b1' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\' Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550, code = 550', -- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\' Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' abortedcode=550, code = 550' 2020-03-21 21:43:00,336+02 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] CommandAsyncTask::endActionIfNecessary: All tasks of command 'e2f495fc-9c21-47fa-b035-24fa39f689ea' has ended -> executing 'endAction' 2020-03-21 21:43:00,336+02 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: 'e2f495fc-9c21-47fa-b035-24fa39f689ea'): calling endAction '. 2020-03-21 21:43:00,337+02 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-13224) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'CreateVolumeContainer', 2020-03-21 21:43:00,341+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (EE-ManagedThreadFactory-engine-Thread-13224) [a4d7bedb-0782-4324-be0e-a3b13e4b37bc] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand' with failure. VDSM log: 2020-03-21 21:42:57,373+0200 ERROR (tasks/3) [storage.Volume] Failed to create volume /rhev/data-center/mnt/blockSD/2eab638c-d8aa-4a2d-9710-aaaa9b64c410/images/a5d54a36-40da-46c2-ac35-71c714e0a633/d1fee325-96ff -40fb-93ec-49245f3c1791: Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\' Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" alr eady exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' (volume:1166) 2020-03-21 21:42:57,373+0200 ERROR (tasks/3) [storage.Volume] Unexpected error (volume:1202) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create initial_size=initial_size) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 503, in _create initialTags=(sc.TAG_VOL_UNINIT,)) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1406, in createLV raise se.CannotCreateLogicalVolume(vgName, lvName, err) vdsm.storage.exception.CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\' Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' 2020-03-21 21:42:57,374+0200 ERROR (tasks/3) [storage.TaskManager.Task] (Task='c7d3ec52-156c-4077-b9da-bfbedc1278b1') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1874, in createVolume initial_size=initialSize) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 971, in createVolume initial_size=initial_size) File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create initial_size=initial_size) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockVolume.py", line 503, in _create initialTags=(sc.TAG_VOL_UNINIT,)) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1406, in createLV raise se.CannotCreateLogicalVolume(vgName, lvName, err) vdsm.storage.exception.CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=2eab638c-d8aa-4a2d-9710-aaaa9b64c410 lvname=d1fee325-96ff-40fb-93ec-49245f3c1791 err=[\' Logical Volume "d1fee325-96ff-40fb-93ec-49245f3c1791" already exists in volume group "2eab638c-d8aa-4a2d-9710-aaaa9b64c410"\']' Created attachment 1672570 [details]
rhv-4.4.0-26 reproduction engine/vdsm logs
Just to clarify after the issue is seen in at a storage domain creating a new qcow floating disk fails forever on that storage domain. This reproduce 100% also via webadmin. (In reply to Avihai from comment #26) > Just to clarify after the issue is seen in at a storage domain creating a > new qcow floating disk fails forever on that storage domain. > This reproduce 100% also via webadmin. What vdsm version is currently in use? no version appears on the vdsm.log itself from comment #25 it should be later than 4.40.8 OTOH, i see exception for an already existing file volume during creation in log of comment #25 which will probably not relate to fixes per 4.40.8 and later 2020-03-21 21:42:06,696+0200 INFO (tasks/4) [storage.Volume] Creating volume d1fee325-96ff-40fb-93ec-49245f3c1791 (volume:1142) 2020-03-21 21:42:06,735+0200 ERROR (tasks/4) [storage.Volume] Failed to create volume /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__6__nfs__3/16483589-b481-49ed-8f06-596df6330afb/images/a5d54a36-40da-46c2-ac35-71c714e0a633/d1fee325-96ff-40fb-93ec-49245f3c1791: Volume already exists: ('d1fee325-96ff-40fb-93ec-49245f3c1791',) (volume:1166) 2020-03-21 21:42:06,735+0200 ERROR (tasks/4) [storage.Volume] Unexpected error (volume:1202) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 542, in _truncate_volume creatExcl=True) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 347, in truncateFile ioproc.truncate(path, size, mode if mode is not None else 0, creatExcl) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 576, in truncate self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 448, in _sendCommand raise OSError(errcode, errstr) FileExistsError: [Errno 17] File exists During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create initial_size=initial_size) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 457, in _create imgUUID, srcImgUUID, srcVolUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 513, in _create_cow_volume cls._truncate_volume(vol_path, 0, vol_id, dom) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 545, in _truncate_volume raise se.VolumeAlreadyExists(vol_id) vdsm.storage.exception.VolumeAlreadyExists: Volume already exists: ('d1fee325-96ff-40fb-93ec-49245f3c1791',) 2020-03-21 21:42:06,735+0200 ERROR (tasks/4) [storage.TaskManager.Task] (Task='0840b2d9-9f0c-4e92-ab10-a89befbcc899') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 542, in _truncate_volume creatExcl=True) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 347, in truncateFile ioproc.truncate(path, size, mode if mode is not None else 0, creatExcl) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 576, in truncate self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 448, in _sendCommand raise OSError(errcode, errstr) FileExistsError: [Errno 17] File exists During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1874, in createVolume initial_size=initialSize) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 971, in createVolume initial_size=initial_size) File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1163, in create initial_size=initial_size) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 457, in _create imgUUID, srcImgUUID, srcVolUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 513, in _create_cow_volume cls._truncate_volume(vol_path, 0, vol_id, dom) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileVolume.py", line 545, in _truncate_volume raise se.VolumeAlreadyExists(vol_id) vdsm.storage.exception.VolumeAlreadyExists: Volume already exists: ('d1fee325-96ff-40fb-93ec-49245f3c1791',) (In reply to Amit Bawer from comment #27) > (In reply to Avihai from comment #26) > > Just to clarify after the issue is seen in at a storage domain creating a > > new qcow floating disk fails forever on that storage domain. > > This reproduce 100% also via webadmin. > > What vdsm version is currently in use? no version appears on the vdsm.log > itself from comment #25 > it should be later than 4.40.8 I did not see this issue in the latest versions. Tested it today on: ovirt-engine-4.4.0-0.26.master.el8ev.noarch vdsm-4.40.7-1.el8ev.x86_64 and didn't find any issues. (In reply to Yosi Ben Shimon from comment #29) > (In reply to Amit Bawer from comment #27) > > (In reply to Avihai from comment #26) > > > Just to clarify after the issue is seen in at a storage domain creating a > > > new qcow floating disk fails forever on that storage domain. > > > This reproduce 100% also via webadmin. > > > > What vdsm version is currently in use? no version appears on the vdsm.log > > itself from comment #25 > > it should be later than 4.40.8 > > I did not see this issue in the latest versions. > Tested it today on: > ovirt-engine-4.4.0-0.26.master.el8ev.noarch > vdsm-4.40.7-1.el8ev.x86_64 > > and didn't find any issues. Did you try this with nfs domain? The exception for the already existing volume in comment #28 is for creating cow (thin) file volume over nfs storage. (In reply to Amit Bawer from comment #30) > (In reply to Yosi Ben Shimon from comment #29) > > (In reply to Amit Bawer from comment #27) > > > (In reply to Avihai from comment #26) > > > > Just to clarify after the issue is seen in at a storage domain creating a > > > > new qcow floating disk fails forever on that storage domain. > > > > This reproduce 100% also via webadmin. > > > > > > What vdsm version is currently in use? no version appears on the vdsm.log > > > itself from comment #25 > > > it should be later than 4.40.8 > > > > I did not see this issue in the latest versions. > > Tested it today on: > > ovirt-engine-4.4.0-0.26.master.el8ev.noarch > > vdsm-4.40.7-1.el8ev.x86_64 > > > > and didn't find any issues. > > Did you try this with nfs domain? > The exception for the already existing volume in comment #28 is for creating > cow (thin) file volume over nfs storage. Yes. Tried both nfs and iscsi domains. No errors or exceptions in VDSM logs. (In reply to Yosi Ben Shimon from comment #31) > (In reply to Amit Bawer from comment #30) > > (In reply to Yosi Ben Shimon from comment #29) > > > (In reply to Amit Bawer from comment #27) > > > > (In reply to Avihai from comment #26) > > > > > Just to clarify after the issue is seen in at a storage domain creating a > > > > > new qcow floating disk fails forever on that storage domain. > > > > > This reproduce 100% also via webadmin. > > > > > > > > What vdsm version is currently in use? no version appears on the vdsm.log > > > > itself from comment #25 > > > > it should be later than 4.40.8 > > > > > > I did not see this issue in the latest versions. > > > Tested it today on: > > > ovirt-engine-4.4.0-0.26.master.el8ev.noarch > > > vdsm-4.40.7-1.el8ev.x86_64 > > > > > > and didn't find any issues. > > > > Did you try this with nfs domain? > > The exception for the already existing volume in comment #28 is for creating > > cow (thin) file volume over nfs storage. > > Yes. > Tried both nfs and iscsi domains. > No errors or exceptions in VDSM logs. Thanks for re-checking, could not hit this either on nfs domain (vdsm-4.40.9-14.gitd83ac3bf2.el8.x86_64). Unless someone is investigating one of the issues here, suggest to close this bz and use more specific one if something comes up. Closing this bug. If it will occur again, we'll investigate it and open a new one with all the relevant details. |