Description of problem: Version-Release number of selected component (if applicable): * Datacenter with iSCSI and NFS SDs based. * vdsm LVM monitor actions are performed against NFS SDs * vdsm version vdsm-4.40.90.5 * Registered warning messages, UUIDs 028b34b7-70de-43a7-81a2-52bf04f22253 and d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf are NFS storage based trying to be read with lvm commands, 2022-01-31 14:14:51,417+0100 WARN (monitor/028b34b) [storage.LVM] All 1 tries have failed: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgs\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/0QEMU_QEMU_HARDDISK_01204632-58d4-11ec-baaa-98eecbbd1afb$|^/dev/mapper/0QEMU_QEMU_HARDDISK_44134d7c-79cc-11ec-a424-3ce1a1c47e3c$|^/dev/mapper/0QEMU_QEMU_HARDDISK_799ab0bc-52a7-11ec-8397-525400fff335$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--noheadings\', \'--units\', \'b\', \'--nosuffix\', \'--separator\', \'|\', \'--ignoreskippedcluster\', \'-o\', \'uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name\', \'028b34b7-70de-43a7-81a2-52bf04f22253\'] rc=5 out=[] err=[\' Volume group "028b34b7-70de-43a7-81a2-52bf04f22253" not found\', \' Cannot process volume group 028b34b7-70de-43a7-81a2-52bf04f22253\']' (lvm:561) 2022-01-31 14:08:47,318+0100 WARN (tasks/4) [storage.LVM] All 1 tries have failed: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgs\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/0QEMU_QEMU_HARDDISK_01204632-58d4-11ec-baaa-98eecbbd1afb$|^/dev/mapper/0QEMU_QEMU_HARDDISK_44134d7c-79cc-11ec-a424-3ce1a1c47e3c$|^/dev/mapper/0QEMU_QEMU_HARDDISK_799ab0bc-52a7-11ec-8397-525400fff335$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--noheadings\', \'--units\', \'b\', \'--nosuffix\', \'--separator\', \'|\', \'--ignoreskippedcluster\', \'-o\', \'uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name\', \'d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf\'] rc=5 out=[] err=[\' Volume group "d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf" not found\', \' Cannot process volume group d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf\']' (lvm:561) 2022-01-31 14:08:47,354+0100 INFO (tasks/4) [storage.StorageDomainCache] Looking up domain d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf: 0.15 seconds (utils:390) 2022-01-31 14:08:47,354+0100 ERROR (tasks/4) [storage.StoragePool] Backup domain validation failed (sp:360) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 357, in startSpm self.checkBackupDomain(__securityOverride=True) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1606, in checkBackupDomain dom = sdCache.produce(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce domain.getRealDomain() File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain return findMethod(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf',) How reproducible: Put in maintenance mode some iSCSI or FC SD and enable. Steps to Reproduce: Actual results: SD monitor actions failed but registerd as warning, but in other cases triggers some LV SD actions. Expected results: Only acccess by lvm commands SDs that are iSCSI or FC based. Additional info:
We are past 4.5.0 feature freeze, please re-target.
What a surprise!! The issue doesn't arise with vdsm-4.40.100.2-1.el8ev.x86_64, only show the warning message, is by design is a correct behaviour, but not the stack trace and rebuild the rhev links. I'm going to close the bugzilla as fixed.
This is not really fixed, this is the expected behavior with the current system. When looking up a storage domain, vdsm uses the domain find method: 150 def _findDomain(self, sdUUID): 151 try: 152 findMethod = self.knownSDs[sdUUID] 153 except KeyError: 154 findMethod = self._findUnfetchedDomain 155 156 return findMethod(sdUUID) But if the domain is unknown, vdsm fallback to search the storage domain in all connected servers: 158 def _findUnfetchedDomain(self, sdUUID): 159 from vdsm.storage import blockSD 160 from vdsm.storage import glusterSD 161 from vdsm.storage import localFsSD 162 from vdsm.storage import nfsSD 163 164 # The order is somewhat important, it's ordered 165 # by how quickly get can find the domain. For instance 166 # if an nfs mount is unavailable we will get stuck 167 # until it times out, this should affect fetching 168 # of block\local domains. If for any case in the future 169 # this changes, please update the order. 170 171 self.log.info("Looking up domain %s", sdUUID) 172 with utils.stopwatch( 173 "Looking up domain {}".format(sdUUID), 174 level=logging.INFO, 175 log=self.log): 176 for mod in (blockSD, glusterSD, localFsSD, nfsSD): 177 try: 178 return mod.findDomain(sdUUID) 179 except se.StorageDomainDoesNotExist: 180 pass 181 except Exception: 182 self.log.error( 183 "Error while looking for domain `%s`", 184 sdUUID, exc_info=True) In this case when looking up the storage domain in blockSD, we run vgs and fail to find the domain. This logs a bogus error instead of a debug log, the current code does not have any context to tell that this is a "check if domain exist" instead of "get an existing domain". I think we have 2 issues: 1. Why the storage domain is unknown? domains find methods are added to knownSDs dict when connecting to storage server. They should not be deleted when the domain is in use. I guess there is a bug in this mechnaism. 2. When looking up unknown domain not finding the domain is not an error and it should not be logged as an error. Opening the bug since this is a real bug. Removing from 4.5.1 since this is very old issue with low severity. We may fix this in 4.4.z.
It's a low severity issue as it is "recovered" automatically when connecting the storage server Yet, it creates a misleading error that also affected us (DEV) so we'd like to address this
More info on the logs - we have 2 logs: 1. Warning when running "vgs" 2022-01-31 14:14:51,417+0100 WARN (monitor/028b34b) [storage.LVM] All 1 tries have failed: ... This log is left over from times we retried the lvm command before failing. Now we run once with specific lvm devices, and again with all devices, so this log is not very helpful and can be removed. 2. Traceback in startSpm 2022-01-31 14:08:47,354+0100 ERROR (tasks/4) [storage.StoragePool] Backup domain validation failed (sp:360) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 357, in startSpm self.checkBackupDomain(__securityOverride=True) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1606, in checkBackupDomain dom = sdCache.produce(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce domain.getRealDomain() File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain return findMethod(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf',) This is a real error - it means that we looked for the storage domain in all storage domain types and we could not find it. The first issue is very common, and easy to fix. The second is much more rare (startSpm in some condition) and likely harder to fix. Lets handle the first issue in this bug and file another bug for second.
As stated above, this bug handles the issue of the unrelated warning that pops up when looking for storage domains (i.e., "2022-01-31 14:14:51,417+0100 WARN (monitor/028b34b) [storage.LVM] All 1 tries have failed: ..."), the second issue is tracked by upstream issue https://github.com/oVirt/vdsm/issues/158.
QE doesn't have the capacity to verify this bug during 4.5.1. The team focuses on CBT tasks.
Hi Albert, Can you please provide here the steps to reproduce this error?
Hi Sophie, Note that the original issue is not addressed in the linked PR, as it has been split into an upstream issue. That one, we don't know how to reproduce it yet.
Verified. Could reproduce it on vdsm-4.50.0.13 with an automation TC that triggered this WARN message "All 1 tries have failed: LVM command failed". On the fixed one we don't see those warnings anymore. Versions: engine-4.5.1.2-0.11.el8ev vdsm-4.50.1.3-1.el8ev
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.5.1] update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5583
Due to QE capacity we are not going to cover this issue in our automation