Bug 2048545 - Bogus errors and warning when looking up NFS storage domain using LVM
Summary: Bogus errors and warning when looking up NFS storage domain using LVM
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.4.9
Hardware: All
OS: All
low
low
Target Milestone: ovirt-4.5.1
: ---
Assignee: Albert Esteve
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-31 13:40 UTC by José Enrique
Modified: 2022-07-20 08:52 UTC (History)
9 users (show)

Fixed In Version: 4.50.1.3
Doc Type: Bug Fix
Doc Text:
LVM command error messages were improved to provide clearer information about how to trace and debug errors.
Clone Of:
Environment:
Last Closed: 2022-07-14 12:41:13 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 159 0 None Merged lvm: remove outdated warning 2022-06-15 06:37:48 UTC
Red Hat Issue Tracker RHV-44584 0 None None None 2022-01-31 13:48:01 UTC
Red Hat Product Errata RHBA-2022:5583 0 None None None 2022-07-14 12:41:29 UTC

Description José Enrique 2022-01-31 13:40:14 UTC
Description of problem:


Version-Release number of selected component (if applicable):

* Datacenter with iSCSI and NFS SDs based.
* vdsm LVM monitor actions are performed against NFS SDs
* vdsm version vdsm-4.40.90.5
* Registered warning messages,

UUIDs 028b34b7-70de-43a7-81a2-52bf04f22253 and d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf are NFS storage based trying to be read with lvm commands,

2022-01-31 14:14:51,417+0100 WARN  (monitor/028b34b) [storage.LVM] All 1 tries have failed: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgs\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/0QEMU_QEMU_HARDDISK_01204632-58d4-11ec-baaa-98eecbbd1afb$|^/dev/mapper/0QEMU_QEMU_HARDDISK_44134d7c-79cc-11ec-a424-3ce1a1c47e3c$|^/dev/mapper/0QEMU_QEMU_HARDDISK_799ab0bc-52a7-11ec-8397-525400fff335$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--noheadings\', \'--units\', \'b\', \'--nosuffix\', \'--separator\', \'|\', \'--ignoreskippedcluster\', \'-o\', \'uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name\', \'028b34b7-70de-43a7-81a2-52bf04f22253\'] rc=5 out=[] err=[\'  Volume group "028b34b7-70de-43a7-81a2-52bf04f22253" not found\', \'  Cannot process volume group 028b34b7-70de-43a7-81a2-52bf04f22253\']' (lvm:561)

2022-01-31 14:08:47,318+0100 WARN  (tasks/4) [storage.LVM] All 1 tries have failed: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgs\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/0QEMU_QEMU_HARDDISK_01204632-58d4-11ec-baaa-98eecbbd1afb$|^/dev/mapper/0QEMU_QEMU_HARDDISK_44134d7c-79cc-11ec-a424-3ce1a1c47e3c$|^/dev/mapper/0QEMU_QEMU_HARDDISK_799ab0bc-52a7-11ec-8397-525400fff335$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--noheadings\', \'--units\', \'b\', \'--nosuffix\', \'--separator\', \'|\', \'--ignoreskippedcluster\', \'-o\', \'uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name\', \'d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf\'] rc=5 out=[] err=[\'  Volume group "d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf" not found\', \'  Cannot process volume group d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf\']' (lvm:561)
2022-01-31 14:08:47,354+0100 INFO  (tasks/4) [storage.StorageDomainCache] Looking up domain d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf: 0.15 seconds (utils:390)
2022-01-31 14:08:47,354+0100 ERROR (tasks/4) [storage.StoragePool] Backup domain validation failed (sp:360)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 357, in startSpm
    self.checkBackupDomain(__securityOverride=True)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1606, in checkBackupDomain
    dom = sdCache.produce(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce
    domain.getRealDomain()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf',)



How reproducible:

Put in maintenance mode some iSCSI or FC SD and enable.

Steps to Reproduce:

Actual results:

SD monitor actions failed but registerd as warning, but in other cases triggers some LV SD actions.


Expected results:

Only acccess by lvm commands SDs that are iSCSI or FC based.

Additional info:

Comment 4 Sandro Bonazzola 2022-03-29 16:16:40 UTC
We are past 4.5.0 feature freeze, please re-target.

Comment 5 José Enrique 2022-04-21 15:16:01 UTC
What a surprise!! The issue doesn't arise with vdsm-4.40.100.2-1.el8ev.x86_64, only show the warning message, is by design is a correct behaviour, but not the stack trace and rebuild the rhev links.

I'm going to close the bugzilla as fixed.

Comment 6 Nir Soffer 2022-04-21 19:58:01 UTC
This is not really fixed, this is the expected behavior with the current system.

When looking up a storage domain, vdsm uses the domain find method:

150     def _findDomain(self, sdUUID):
151         try:
152             findMethod = self.knownSDs[sdUUID]
153         except KeyError:
154             findMethod = self._findUnfetchedDomain
155 
156         return findMethod(sdUUID)

But if the domain is unknown, vdsm fallback to search the storage
domain in all connected servers:

158     def _findUnfetchedDomain(self, sdUUID):
159         from vdsm.storage import blockSD
160         from vdsm.storage import glusterSD
161         from vdsm.storage import localFsSD
162         from vdsm.storage import nfsSD
163 
164         # The order is somewhat important, it's ordered
165         # by how quickly get can find the domain. For instance
166         # if an nfs mount is unavailable we will get stuck
167         # until it times out, this should affect fetching
168         # of block\local domains. If for any case in the future
169         # this changes, please update the order.
170 
171         self.log.info("Looking up domain %s", sdUUID)
172         with utils.stopwatch(
173                 "Looking up domain {}".format(sdUUID),
174                 level=logging.INFO,
175                 log=self.log):
176             for mod in (blockSD, glusterSD, localFsSD, nfsSD):
177                 try:
178                     return mod.findDomain(sdUUID)
179                 except se.StorageDomainDoesNotExist:
180                     pass
181                 except Exception:
182                     self.log.error(
183                         "Error while looking for domain `%s`",
184                         sdUUID, exc_info=True)

In this case when looking up the storage domain in blockSD, we run vgs
and fail to find the domain. This logs a bogus error instead of a debug
log, the current code does not have any context to tell that this is
a "check if domain exist" instead of "get an existing domain".

I think we have 2 issues:

1. Why the storage domain is unknown? domains find methods are added
   to knownSDs dict when connecting to storage server. They should not
   be deleted when the domain is in use. I guess there is a bug in this
   mechnaism.

2. When looking up unknown domain not finding the domain is not an error
   and it should not be logged as an error.

Opening the bug since this is a real bug.

Removing from 4.5.1 since this is very old issue with low severity. We may
fix this in 4.4.z.

Comment 7 Arik 2022-04-25 14:54:54 UTC
It's a low severity issue as it is "recovered" automatically when connecting the storage server
Yet, it creates a misleading error that also affected us (DEV) so we'd like to address this

Comment 8 Nir Soffer 2022-04-26 16:50:46 UTC
More info on the logs - we have 2 logs:

1. Warning when running "vgs"

    2022-01-31 14:14:51,417+0100 WARN  (monitor/028b34b) [storage.LVM] All 1 tries have failed: ...

This log is left over from times we retried the lvm command
before failing. Now we run once with specific lvm devices, and again
with all devices, so this log is not very helpful and can be removed.

2. Traceback in startSpm

2022-01-31 14:08:47,354+0100 ERROR (tasks/4) [storage.StoragePool] Backup domain validation failed (sp:360)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 357, in startSpm
    self.checkBackupDomain(__securityOverride=True)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1606, in checkBackupDomain
    dom = sdCache.produce(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce
    domain.getRealDomain()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('d2a5a1ca-34f8-4b4b-9eb8-946e117e45bf',)

This is a real error - it means that we looked for the storage domain in
all storage domain types and we could not find it.

The first issue is very common, and easy to fix. The second is much more rare
(startSpm in some condition) and likely harder to fix.

Lets handle the first issue in this bug and file another bug for second.

Comment 9 Albert Esteve 2022-05-02 13:47:56 UTC
As stated above, this bug handles the issue of the unrelated warning that pops up when looking for storage domains (i.e., "2022-01-31 14:14:51,417+0100 WARN  (monitor/028b34b) [storage.LVM] All 1 tries have failed: ..."),
the second issue is tracked by upstream issue https://github.com/oVirt/vdsm/issues/158.

Comment 10 Shir Fishbain 2022-05-30 20:01:39 UTC
QE doesn't have the capacity to verify this bug during 4.5.1. The team focuses on CBT tasks.

Comment 13 sshmulev 2022-06-19 10:13:30 UTC
Hi Albert,
Can you please provide here the steps to reproduce this error?

Comment 14 Albert Esteve 2022-06-20 11:12:04 UTC
Hi Sophie,
Note that the original issue is not addressed in the linked PR, as it has been split into an upstream issue. That one, we don't know how to reproduce it yet.

Comment 15 sshmulev 2022-06-21 07:15:11 UTC
Verified.

Could reproduce it on vdsm-4.50.0.13 with an automation TC that triggered this WARN message "All 1 tries have failed: LVM command failed".
On the fixed one we don't see those warnings anymore.

Versions:
engine-4.5.1.2-0.11.el8ev
vdsm-4.50.1.3-1.el8ev

Comment 19 errata-xmlrpc 2022-07-14 12:41:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.5.1] update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5583

Comment 20 meital avital 2022-07-20 08:52:35 UTC
Due to QE capacity we are not going to cover this issue in our automation


Note You need to log in before you can comment on or make changes to this bug.