Bug 1344314

Summary: "VDSM HOST2 command failed: Cannot find master domain" after adding storage
Product: Red Hat Enterprise Virtualization Manager Reporter: Dusan Fodor <dfodor>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED ERRATA QA Contact: Carlos Mestre González <cmestreg>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.7CC: acanan, amureini, cmestreg, dfediuck, dfodor, gklein, knarra, lsurette, mlipchuk, rbalakri, Rhev-m-bugs, sbonazzo, srevivo, ykaul, ylavi
Target Milestone: ovirt-4.0.1   
Target Release: 4.0.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1349404 1349405 (view as bug list) Environment:
Last Closed: 2016-08-23 20:41:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1258386, 1349404, 1349405    
Attachments:
Description Flags
screenshot of events and logfiles
none
correct version of vdsm log file
none
vdsm.log for both hosts and engine. none

Description Dusan Fodor 2016-06-09 11:59:55 UTC
Created attachment 1166248 [details]
screenshot of events and logfiles

Description of problem:
Following error appears in UI when trying to configure storage; everything seems to work fine though
VDSM HOST2 command failed: Cannot find master domain: u'spUUID=83958f66-8809-4260-a13f-9ba2a4619f2a, msdUUID=6e1cf774-8caf-4f5d-9491-f01e618cf656'

Version-Release number of selected component (if applicable):
3.6.7-4

How reproducible:
Tried twice, same result

Steps to Reproduce:
1. Create DC
2. Create 2 Clusters
3. Create Host on each
4. Attach NFS storage
5. See error


Actual results:
Storage attached successfully with error message displayed

Expected results:
Same, without showing error message

Additional info:

Comment 1 Allon Mureinik 2016-06-09 12:36:12 UTC
This seems to happen during disk registration.
Maor, can you take a look please?

Comment 3 Maor 2016-06-09 13:53:50 UTC
Hi Dusan,

It looks like the logs are not synced.
In the engine, the error is at 2016-06-08 19:13:09,230
and the VDSM log only starts at 2016-06-09 08:01:01,906

Can you please add all the relevant vdsm logs.

Also which VDSM version are you using?

Thanks

Comment 4 Dusan Fodor 2016-06-09 15:24:01 UTC
Created attachment 1166350 [details]
correct version of vdsm log file

Hi,

Sorry, i didn't notice it was rewriten.
Correct version (according to timestamp) attached.
The VDSM version is 4.17.31

Thanks

Comment 5 Maor 2016-06-13 08:58:42 UTC
The issue here is that the Data Center contained two Hosts in an uninitialized Data Center.

It looks like createStoragePool was being done on Host1, while connectStoragePool was being done on Host2, and that caused connectStoragePool to fail since only Host1 knew the master domain:

  2016-06-08 19:12:41,796 .... ConnectStorageServerVDSCommand] START, ConnectStorageServerVDSCommand(HostName = HOST1

  2016-06-08 19:12:43,001 INFO ...CreateStoragePoolVDSCommand] START, CreateStoragePoolVDSCommand(HostName = HOST1

  2016-06-08 19:13:07,791 INFO ....ConnectStoragePoolVDSCommand] START, ConnectStoragePoolVDSCommand(HostName = HOST2

When calling ConnectStoragePoolVDSCommand we get the Host by using the method IrsProxyData#getPrioritizedVdsInPool, this method gives a random Host, and that Host might not be the one that called CreateStoragePoolVDSCommand and update its storage domains cache with the master SD.

Comment 6 Sahina Bose 2016-06-20 12:05:27 UTC
*** Bug 1330827 has been marked as a duplicate of this bug. ***

Comment 8 Carlos Mestre González 2016-07-11 10:15:50 UTC
No error is shown in the events, but I checked the vdsm logs and I see the error, versions: vdsm-4.18.5.1-1.el7ev.x86_64 rhevm-4.0.2-0.2.rc1.el7ev.noarch

1. A DC, a cluster and two hosts connect to that cluster, add an nfs domain, I see an error in one of the hosts:

ioprocess communication (23127)::INFO::2016-07-11 13:07:26,080::__init__::447::IOProcess::(_pr
ocessLogs) Starting ioprocess
ioprocess communication (23127)::INFO::2016-07-11 13:07:26,080::__init__::447::IOProcess::(_pr
ocessLogs) Starting ioprocess
jsonrpc.Executor/6::ERROR::2016-07-11 13:07:26,081::sdc::146::Storage.StorageDomainCache::(_fi
ndDomain) domain 058eedf5-e92b-47a8-a3b7-a5c5db6baac1 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('058eedf5-e92b-47a8-a3b7-a5c5db6baa
c1',)
jsonrpc.Executor/6::INFO::2016-07-11 13:07:26,082::nfsSD::70::Storage.StorageDomain::(create) 
sdUUID=058eedf5-e92b-47a8-a3b7-a5c5db6baac1 domainName=test_nfs remotePath=10.35.64.11:/vol/RH
EV/Storage/storage_jenkins_ge19_nfs_3 domClass=1

Maor, is this important? Operation seems fine

Comment 9 Maor 2016-07-11 12:27:04 UTC
I'm not sure if it's related to that issue, does it happen consistently?

Comment 10 Carlos Mestre González 2016-07-12 10:29:40 UTC
Created attachment 1178845 [details]
vdsm.log for both hosts and engine.

Yes, tested it twice and saw it both times.

DC with Cluster and two hosts, add one NFS domain (choose host_1) and the error shows in host_1:

jsonrpc.Executor/3::ERROR::2016-07-12 13:22:40,918::sdc::146::Storage.StorageDomainCache::(_findDomain) domain d388f7ee-ac92-4ce8-9a7a-81d4e69adb16 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('d388f7ee-ac92-4ce8-9a7a-81d4e69adb16',)
jsonrpc.Executor/3::INFO::2016-07-12 13:22:40,920::nfsSD::70::Storage.StorageDomain::(create) sdUUID=d388f7ee-ac92-4ce8-9a7a-81d4e69adb16 domainName=nfs_test_dc remotePath=10.35.64.11:/vol/RHEV/Storage/storage_jenkins_ge19_nfs_4 domClass=1
jsonrpc.Executor/3::DEBUG::2016-07-12 13:22:40,928::outOfProcess::69::Storage.oop::(getProcessPool) Creating ioprocess d388f7ee-ac92-4ce8-9a7a-81d4e69adb16
jsonrpc.Executor/3::INFO::2016-07-12 13:22:40,928::__init__::325::IOProcessClient::(__init__) Starting client ioprocess-3
jsonrpc.Executor/3::DEBUG::2016-07-12 13:22:40,928::__init__::334::IOProcessClient::(_run) Starting ioprocess for client ioprocess-3

Comment 11 Allon Mureinik 2016-07-12 11:34:01 UTC
This is an old logging issue we've had since oVrit 3.2 IIRC. I agree it's ugly, but there's no real effect there.

You can move the BZ to VERIFIED, thanks!

Comment 13 errata-xmlrpc 2016-08-23 20:41:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html