Bug 1345787 - Logging: useless alert on engine side: command failed: Cannot find master domain: ... (followed with UUIDs)
Summary: Logging: useless alert on engine side: command failed: Cannot find master dom...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.0.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.1.0-beta
: 4.1.0.2
Assignee: Idan Shaby
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-13 07:55 UTC by Yaniv Kaul
Modified: 2017-02-01 14:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-01 14:43:21 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: planning_ack+
rule-engine: devel_ack+
ratamir: testing_ack+


Attachments (Terms of Use)
engine.log (868.50 KB, application/x-gzip)
2016-06-13 07:55 UTC, Yaniv Kaul
no flags Details
vdsm log from host1 (442.64 KB, application/x-gzip)
2016-06-13 07:57 UTC, Yaniv Kaul
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1344314 0 high CLOSED "VDSM HOST2 command failed: Cannot find master domain" after adding storage 2021-02-22 00:41:40 UTC
oVirt gerrit 70736 0 master MERGED backend: log VDS command's name on failure 2017-01-18 13:50:44 UTC
oVirt gerrit 70800 0 ovirt-engine-4.1 MERGED backend: log VDS command's name on failure 2017-01-19 08:58:34 UTC

Internal Links: 1344314

Description Yaniv Kaul 2016-06-13 07:55:54 UTC
Created attachment 1167255 [details]
engine.log

Description of problem:
Regardless of the specific issue, the following alert that I see in Engine is useless:
VDSM lago_basic_suite_master_host0 command failed: Cannot find master domain: u'spUUID=15b7bb93-24d3-4d9e-bd1e-a2aeff8bf15e, msdUUID=fb92aed6-9deb-42a2-8215-cf91e2b1bba2'

1. Which command failed?
2. What are those UUID's?
3. What should the user do?

Need to vastly improve the error message.
I'll attach VDSM and engine logs, but I'm really looking for an improvement to the logs, not about the exact cause here.

Note that all SDs are up, all hosts are up, all DCs are up, etc.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20160612231308.gitff046da.el7.centos.noarch

How reproducible:


Steps to Reproduce:
1. Running ovirt-system-tests - which all seem to succeed.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yaniv Kaul 2016-06-13 07:57:35 UTC
Created attachment 1167256 [details]
vdsm log from host1

Comment 2 Yaniv Kaul 2016-06-13 08:02:08 UTC
I believe the relevant part in VDSM is:
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,772::__init__::519::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'LVMVolumeGroup.create' in bridge with {u'devlist': [u'36001405fbc850aad396424caec568960', u'3600
140529e6146bf72248c7949c91a2d'], u'force': False, u'name': u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2'}
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,773::task::597::Storage.TaskManager.Task::(_updateState) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::moving from state init -> state preparing
jsonrpc.Executor/7::INFO::2016-06-13 03:19:51,773::logUtils::49::dispatcher::(wrapper) Run and protect: createVG(vgname=u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2', devlist=[u'36001405fbc850aad396424caec568960', u'3
600140529e6146bf72248c7949c91a2d'], force=False, options=None)
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,774::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm pvcreate --config ' devices { preferred_names = ["^/dev/map
per/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' 
] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --metadatasize 128m --metadatacopies 2 --metadataignore y /dev/mapper/36
001405fbc850aad396424caec568960 /dev/mapper/3600140529e6146bf72248c7949c91a2d (cwd None)
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,832::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm pvchange --config ' devices { preferred_names = ["^/dev/map
per/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' 
] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --metadataignore n /dev/mapper/36001405fbc850aad396424caec568960 (cwd No
ne)
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,868::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm vgcreate --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --physicalextentsize 128m --addtag RHAT_storage_domain_UNREADY fb92aed6-9deb-42a2-8215-cf91e2b1bba2 /dev/mapper/36001405fbc850aad396424caec568960 /dev/mapper/3600140529e6146bf72248c7949c91a2d (cwd None)
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,914::lvm::952::Storage.LVM::(createVG) Cache after createvg {u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2': Stub(name=u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2', stale=True)}
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,914::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405358c3e2d9ded4394bdd32e34b|/dev/mapper/3600140569bb822ef44147198fb1bc15d|/dev/mapper/36001405cbcd243c08b6432587d674e5a|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name fb92aed6-9deb-42a2-8215-cf91e2b1bba2 (cwd None)
jsonrpc.Executor/7::INFO::2016-06-13 03:19:51,936::logUtils::52::dispatcher::(wrapper) Run and protect: createVG, Return response: {'uuid': 'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK'}
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::task::1193::Storage.TaskManager.Task::(prepare) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::finished: {'uuid': 'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK'}
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::task::597::Storage.TaskManager.Task::(_updateState) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::moving from state preparing -> state finished
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::task::995::Storage.TaskManager.Task::(_decref) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::ref 0 aborting False
jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::__init__::548::jsonrpc.JsonRpcServer::(_serveRequest) Return 'LVMVolumeGroup.create' in bridge with FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK
jsonrpc.Executor/4::DEBUG::2016-06-13 03:19:52,941::__init__::519::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StorageDomain.create' in bridge with {u'name': u'iscsi', u'domainType': 3, u'domainClass': 1, u'typeArgs': u'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK', u'version': u'3', u'storagedomainID': u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2'}
jsonrpc.Executor/4::DEBUG::2016-06-13 03:19:52,942::task::597::Storage.TaskManager.Task::(_updateState) Task=`0f7d4893-db11-404c-bd49-bed3c1f5a096`::moving from state init -> state preparing
jsonrpc.Executor/4::INFO::2016-06-13 03:19:52,942::logUtils::49::dispatcher::(wrapper) Run and protect: createStorageDomain(storageType=3, sdUUID=u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2', domainName=u'iscsi', typeSpecificArg=u'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK', domClass=1, domVersion=u'3', options=None)
jsonrpc.Executor/4::ERROR::2016-06-13 03:19:52,943::sdc::140::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain fb92aed6-9deb-42a2-8215-cf91e2b1bba2
jsonrpc.Executor/4::ERROR::2016-06-13 03:19:52,943::sdc::157::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain fb92aed6-9deb-42a2-8215-cf91e2b1bba2
jsonrpc.Executor/4::ERROR::2016-06-13 03:19:52,943::sdc::146::Storage.StorageDomainCache::(_findDomain) domain fb92aed6-9deb-42a2-8215-cf91e2b1bba2 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2',)


I wonder if it happens upon creation of SD?

Comment 3 Yaniv Kaul 2016-06-13 09:20:25 UTC
I assume this is a dup of bug 1344314 - but the this one is about the useless log, not about the underlying issue, and 1344314 is about the underlying issue.

Comment 4 Maor 2016-06-13 09:37:27 UTC
Indeed looks like a dup

Create storage pool was executed on lago_basic_suite_master_host1:
2016-06-13 03:20:03,822 INFO CreateStoragePoolVDSCommand START, CreateStoragePoolVDSCommand(HostName = lago_basic_suite_master_host1....

Connect Storage pool was executed on lago_basic_suite_master_host0:
2016-06-13 03:20:29,905 INFO ConnectStoragePoolVDSCommand] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_master_host0....

and the error:
[org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (default task-23) [40529d3c] Exception: org.ovirt.engine.core.vdsbroker.irsbroker.IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=15b7bb93-24d3-4d9e-bd1e-a2aeff8bf15e, msdUUID=fb92aed6-9deb-42a2-8215-cf91e2b1bba2'

Comment 5 Idan Shaby 2017-01-18 12:15:34 UTC
Indeed, the command's name is missing - patch [1] adds the failing command's name to the log.

Regarding the uuids - the log is composed from the engine's side prefix, "VDSM ${VdsName} command ${CommandName} failed: ", and then comes the error message from vdsm.
The uuids are a general output of a failing vdsm command.

Regarding the recommendations for the user - since this log belongs to a general failing vdsm command which is a part of a wrapping engine command, you will not find here recommended steps for the user since it can be a part of many different engine flows. Such advices can be found in engine commands' logs.

[1] https://gerrit.ovirt.org/#/c/70736

Comment 6 Raz Tamir 2017-01-19 09:39:47 UTC
Do we have a clear steps to reproduce?

Comment 7 Idan Shaby 2017-01-19 11:19:23 UTC
Sure, sorry for not mentioning.
You can get these audit logs in many scenarios.

This is what I did:
1. Have a master NFS domain.
2. Switch off the nfs service on the storage server (or block the connection using other methods).
3. Wait a ~ minute and watch the audit log messages.
You will see messages in two formats:
- VDSM command <new_command_name> failed: <message_from_vdsm>
- VDSM <host_name> command <new_command_name> failed: <message_from_vdsm>

Previously the log did not contain the command name, so what you need to make sure is that both messages include this <new_command_name>.

Comment 8 Lilach Zitnitski 2017-01-25 14:35:59 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
ovirt-engine-4.1.0.3-0.0.master.20170123115434.git8a69605.el7.centos.noarch
vdsm-4.19.2-1.el7.centos.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. Have a master NFS domain.
2. Block connection from all hosts to the nfs server.

Actual results:

both formats appear with new command name

VDSM command GetStoragePoolInfoVDS failed: Heartbeat exceeded

VDSM blond-vdsh command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=8af4e337-976e-4415-aa2b-4bf95fef9908'

Expected results:

Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.