Created attachment 1167255 [details] engine.log Description of problem: Regardless of the specific issue, the following alert that I see in Engine is useless: VDSM lago_basic_suite_master_host0 command failed: Cannot find master domain: u'spUUID=15b7bb93-24d3-4d9e-bd1e-a2aeff8bf15e, msdUUID=fb92aed6-9deb-42a2-8215-cf91e2b1bba2' 1. Which command failed? 2. What are those UUID's? 3. What should the user do? Need to vastly improve the error message. I'll attach VDSM and engine logs, but I'm really looking for an improvement to the logs, not about the exact cause here. Note that all SDs are up, all hosts are up, all DCs are up, etc. Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.0.master.20160612231308.gitff046da.el7.centos.noarch How reproducible: Steps to Reproduce: 1. Running ovirt-system-tests - which all seem to succeed. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1167256 [details] vdsm log from host1
I believe the relevant part in VDSM is: jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,772::__init__::519::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'LVMVolumeGroup.create' in bridge with {u'devlist': [u'36001405fbc850aad396424caec568960', u'3600 140529e6146bf72248c7949c91a2d'], u'force': False, u'name': u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2'} jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,773::task::597::Storage.TaskManager.Task::(_updateState) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::moving from state init -> state preparing jsonrpc.Executor/7::INFO::2016-06-13 03:19:51,773::logUtils::49::dispatcher::(wrapper) Run and protect: createVG(vgname=u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2', devlist=[u'36001405fbc850aad396424caec568960', u'3 600140529e6146bf72248c7949c91a2d'], force=False, options=None) jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,774::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm pvcreate --config ' devices { preferred_names = ["^/dev/map per/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --metadatasize 128m --metadatacopies 2 --metadataignore y /dev/mapper/36 001405fbc850aad396424caec568960 /dev/mapper/3600140529e6146bf72248c7949c91a2d (cwd None) jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,832::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm pvchange --config ' devices { preferred_names = ["^/dev/map per/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --metadataignore n /dev/mapper/36001405fbc850aad396424caec568960 (cwd No ne) jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,868::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm vgcreate --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --physicalextentsize 128m --addtag RHAT_storage_domain_UNREADY fb92aed6-9deb-42a2-8215-cf91e2b1bba2 /dev/mapper/36001405fbc850aad396424caec568960 /dev/mapper/3600140529e6146bf72248c7949c91a2d (cwd None) jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,914::lvm::952::Storage.LVM::(createVG) Cache after createvg {u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2': Stub(name=u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2', stale=True)} jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,914::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600140529e6146bf72248c7949c91a2d|/dev/mapper/36001405358c3e2d9ded4394bdd32e34b|/dev/mapper/3600140569bb822ef44147198fb1bc15d|/dev/mapper/36001405cbcd243c08b6432587d674e5a|/dev/mapper/36001405fbc850aad396424caec568960|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name fb92aed6-9deb-42a2-8215-cf91e2b1bba2 (cwd None) jsonrpc.Executor/7::INFO::2016-06-13 03:19:51,936::logUtils::52::dispatcher::(wrapper) Run and protect: createVG, Return response: {'uuid': 'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK'} jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::task::1193::Storage.TaskManager.Task::(prepare) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::finished: {'uuid': 'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK'} jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::task::597::Storage.TaskManager.Task::(_updateState) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::moving from state preparing -> state finished jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::task::995::Storage.TaskManager.Task::(_decref) Task=`fc4eb275-834d-4dc0-9509-c0c381dc5257`::ref 0 aborting False jsonrpc.Executor/7::DEBUG::2016-06-13 03:19:51,936::__init__::548::jsonrpc.JsonRpcServer::(_serveRequest) Return 'LVMVolumeGroup.create' in bridge with FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK jsonrpc.Executor/4::DEBUG::2016-06-13 03:19:52,941::__init__::519::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StorageDomain.create' in bridge with {u'name': u'iscsi', u'domainType': 3, u'domainClass': 1, u'typeArgs': u'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK', u'version': u'3', u'storagedomainID': u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2'} jsonrpc.Executor/4::DEBUG::2016-06-13 03:19:52,942::task::597::Storage.TaskManager.Task::(_updateState) Task=`0f7d4893-db11-404c-bd49-bed3c1f5a096`::moving from state init -> state preparing jsonrpc.Executor/4::INFO::2016-06-13 03:19:52,942::logUtils::49::dispatcher::(wrapper) Run and protect: createStorageDomain(storageType=3, sdUUID=u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2', domainName=u'iscsi', typeSpecificArg=u'FheYK6-0yN3-CA0J-QhGa-Ya20-sz1t-tbzDyK', domClass=1, domVersion=u'3', options=None) jsonrpc.Executor/4::ERROR::2016-06-13 03:19:52,943::sdc::140::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain fb92aed6-9deb-42a2-8215-cf91e2b1bba2 jsonrpc.Executor/4::ERROR::2016-06-13 03:19:52,943::sdc::157::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain fb92aed6-9deb-42a2-8215-cf91e2b1bba2 jsonrpc.Executor/4::ERROR::2016-06-13 03:19:52,943::sdc::146::Storage.StorageDomainCache::(_findDomain) domain fb92aed6-9deb-42a2-8215-cf91e2b1bba2 not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'fb92aed6-9deb-42a2-8215-cf91e2b1bba2',) I wonder if it happens upon creation of SD?
I assume this is a dup of bug 1344314 - but the this one is about the useless log, not about the underlying issue, and 1344314 is about the underlying issue.
Indeed looks like a dup Create storage pool was executed on lago_basic_suite_master_host1: 2016-06-13 03:20:03,822 INFO CreateStoragePoolVDSCommand START, CreateStoragePoolVDSCommand(HostName = lago_basic_suite_master_host1.... Connect Storage pool was executed on lago_basic_suite_master_host0: 2016-06-13 03:20:29,905 INFO ConnectStoragePoolVDSCommand] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_master_host0.... and the error: [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (default task-23) [40529d3c] Exception: org.ovirt.engine.core.vdsbroker.irsbroker.IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=15b7bb93-24d3-4d9e-bd1e-a2aeff8bf15e, msdUUID=fb92aed6-9deb-42a2-8215-cf91e2b1bba2'
Indeed, the command's name is missing - patch [1] adds the failing command's name to the log. Regarding the uuids - the log is composed from the engine's side prefix, "VDSM ${VdsName} command ${CommandName} failed: ", and then comes the error message from vdsm. The uuids are a general output of a failing vdsm command. Regarding the recommendations for the user - since this log belongs to a general failing vdsm command which is a part of a wrapping engine command, you will not find here recommended steps for the user since it can be a part of many different engine flows. Such advices can be found in engine commands' logs. [1] https://gerrit.ovirt.org/#/c/70736
Do we have a clear steps to reproduce?
Sure, sorry for not mentioning. You can get these audit logs in many scenarios. This is what I did: 1. Have a master NFS domain. 2. Switch off the nfs service on the storage server (or block the connection using other methods). 3. Wait a ~ minute and watch the audit log messages. You will see messages in two formats: - VDSM command <new_command_name> failed: <message_from_vdsm> - VDSM <host_name> command <new_command_name> failed: <message_from_vdsm> Previously the log did not contain the command name, so what you need to make sure is that both messages include this <new_command_name>.
-------------------------------------- Tested with the following code: ---------------------------------------- ovirt-engine-4.1.0.3-0.0.master.20170123115434.git8a69605.el7.centos.noarch vdsm-4.19.2-1.el7.centos.x86_64 Tested with the following scenario: Steps to Reproduce: 1. Have a master NFS domain. 2. Block connection from all hosts to the nfs server. Actual results: both formats appear with new command name VDSM command GetStoragePoolInfoVDS failed: Heartbeat exceeded VDSM blond-vdsh command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=8af4e337-976e-4415-aa2b-4bf95fef9908' Expected results: Moving to VERIFIED!