Bug 782864 - ovirt-engine-core: creating master domain from several targets when there is more than once host will cause non-spm host to become non-operational
Summary: ovirt-engine-core: creating master domain from several targets when there is ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: unspecified
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: lpeer
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-18 17:49 UTC by Dafna Ron
Modified: 2012-01-24 09:16 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-01-24 09:16:38 UTC
oVirt Team: ---
Embargoed:


Attachments (Terms of Use)
logs (371.77 KB, application/x-gzip)
2012-01-18 17:49 UTC, Dafna Ron
no flags Details
new logs (1.17 MB, application/x-gzip)
2012-01-23 17:55 UTC, Dafna Ron
no flags Details

Description Dafna Ron 2012-01-18 17:49:37 UTC
Created attachment 556086 [details]
logs

Description of problem:

adding a first domain (master domain) which is constructed from several luns causes the hsm host to become non-operationl since backend are sending ConnectStorageServer to a target which is listed in multipath but we did not ask to add to domain:

Thread-434::INFO::2012-01-18 17:47:41,619::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='c7256615-d78f-4e03-b8df-67cd8617068a', hostID=2, scsiKey='c7256615-d78f-4e03-b8df-67cd8617068a', msdUUID='3b5eeffd
-9471-4b80-ba0c-234f29acc6fe', masterVersion=1, options=None)
Thread-434::DEBUG::2012-01-18 17:47:41,620::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.c7256615-d78f-4e03-b8df-67cd8617068a`ReqID=`2edfaef0-61c0-4dd8-aa2b-87b57a9aac47`::Request was made in '/usr/share/vds
m/storage/hsm.py' line '747' at '_connectStoragePool'


Version-Release number of selected component (if applicable):

vdsm-4.9.3.1-0.fc16.x86_64
device-mapper-multipath-0.4.9-19.fc16.x86_64
ovirt-engine-backend-3.0.0_0001-1.2.fc16.x86_64

How reproducible:

Steps to Reproduce:
1. try to create a master domain from several luns in iscsi storage (you need more than one host in cluster)
2.
3.
  
Actual results:

hsm host will become non-operational since it cannot connect to pool

Expected results:

we should not be asking to connect to a target that the user did not ask to connect to. 

Additional info: full engine and vdsm logs are attched

Thread-434::ERROR::2012-01-18 17:47:44,683::sdc::113::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `3b5eeffd-9471-4b80-ba0c-234f29acc6fe`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 109, in _findDomain
    return mod.findDomain(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1035, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/blockSD.py", line 259, in __init__
    lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))
  File "/usr/share/vdsm/storage/lvm.py", line 911, in checkVGBlockSizes
    _checkpvsblksize(pvs, vgBlkSize)
  File "/usr/share/vdsm/storage/lvm.py", line 896, in _checkpvsblksize
    pvBlkSize = _getpvblksize(pv)
  File "/usr/share/vdsm/storage/lvm.py", line 891, in _getpvblksize
    dev = devicemapper.getDmId(os.path.basename(pv))
  File "/usr/share/vdsm/storage/devicemapper.py", line 37, in getDmId
    raise OSError(errno.ENODEV, "Could not find dm device named `%s`" % deviceMultipathName)
OSError: [Errno 19] Could not find dm device named `unknown device`
Thread-434::DEBUG::2012-01-18 17:47:44,686::resourceManager::535::ResourceManager::(releaseResource) Trying to release resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a'
Thread-434::DEBUG::2012-01-18 17:47:44,688::resourceManager::550::ResourceManager::(releaseResource) Released resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a' (0 active users)
Thread-434::DEBUG::2012-01-18 17:47:44,689::resourceManager::555::ResourceManager::(releaseResource) Resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a' is free, finding out if anyone is waiting for it.
Thread-434::DEBUG::2012-01-18 17:47:44,689::resourceManager::562::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a', Clearing records.
Thread-434::ERROR::2012-01-18 17:47:44,691::task::855::TaskManager.Task::(_setError) Task=`e0d97593-5c8f-45a0-a4b1-d219164a1584`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 863, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 721, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 763, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 624, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1097, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1435, in getMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=c7256615-d78f-4e03-b8df-67cd8617068a, msdUUID=3b5eeffd-9471-4b80-ba0c-234f29acc6fe'
Thread-434::DEBUG::2012-01-18 17:47:44,692::task::874::TaskManager.Task::(_run) Task=`e0d97593-5c8f-45a0-a4b1-d219164a1584`::Task._run: e0d97593-5c8f-45a0-a4b1-d219164a1584 ('c7256615-d78f-4e03-b8df-67cd8617068a', 2, 'c7256615-d78f-4e03-
b8df-67cd8617068a', '3b5eeffd-9471-4b80-ba0c-234f29acc6fe', 1) {} failed - stopping task

Comment 1 Omer Frenkel 2012-01-19 15:46:35 UTC
i don't understand why you think this is a backend bug, there is a connect to the pool on 1 host which fails:
2012-01-18 17:48:44,601 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (http--0.0.0.0-8080-2) START, ConnectStoragePoolVDSCommand(vdsId = e0c63588-41e4-11e1-b7ec-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, vds_spm_id = 2, masterDomainId = 3b5eeffd-9471-4b80-ba0c-234f29acc6fe, masterVersion = 1), log id: 2e9762fb
2012-01-18 17:48:49,455 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (http--0.0.0.0-8080-2) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         304
mMessage                      Cannot find master domain: 'spUUID=c7256615-d78f-4e03-b8df-67cd8617068a, msdUUID=3b5eeffd-9471-4b80-ba0c-234f29acc6fe'

right after it was created on another host:
2012-01-18 17:48:17,887 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (http--0.0.0.0-8080-2) START, CreateStoragePoolVDSCommand(vdsId = d5178fe8-41e4-11e1-a057-001a4a169741, storagePoolId=c7256615-d78f-4e03-b8df-67cd8617068a, storageType=ISCSI, storagePoolName=Blonds, masterDomainId=3b5eeffd-9471-4b80-ba0c-234f29acc6fe, domainsIdList=[3b5eeffd-9471-4b80-ba0c-234f29acc6fe], masterVersion=1), log id: e89efe6
2012-01-18 17:48:42,024 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (http--0.0.0.0-8080-2) FINISH, CreateStoragePoolVDSCommand, log id: e89efe6

on the first host it succeed, though:
2012-01-18 17:48:49,543 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (http--0.0.0.0-8080-2) START, ConnectStoragePoolVDSCommand(vdsId = d5178fe8-41e4-11e1-a057-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, vds_spm_id = 1, masterDomainId = 3b5eeffd-9471-4b80-ba0c-234f29acc6fe, masterVersion = 1), log id: 7a8f4a50
2012-01-18 17:48:51,099 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (http--0.0.0.0-8080-2) FINISH, ConnectStoragePoolVDSCommand, log id: 7a8f4a50

you can see the parameters to the connect are the same,
of course there was a connect to storage server on both hosts before it all started (same connect parameters, as well...):
2012-01-18 17:48:42,116 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-48) START, ConnectStorageServerVDSCommand(vdsId = e0c63588-41e4-11e1-b7ec-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, storageType = ISCSI, connectionList = [{ id: 9444d223-8107-4267-8b39-613be47f6924, connection: 10.35.64.10 };{ id: e5bf1c79-c5f7-466e-b78d-69dcd3d9234b, connection: 10.35.64.10 };{ id: 803b1efd-c40a-419d-b557-21aefcf6323b, connection: 10.35.64.10 };]), log id: 1c899b48
2012-01-18 17:48:42,119 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-47) START, ConnectStorageServerVDSCommand(vdsId = d5178fe8-41e4-11e1-a057-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, storageType = ISCSI, connectionList = [{ id: 9444d223-8107-4267-8b39-613be47f6924, connection: 10.35.64.10 };{ id: e5bf1c79-c5f7-466e-b78d-69dcd3d9234b, connection: 10.35.64.10 };{ id: 803b1efd-c40a-419d-b557-21aefcf6323b, connection: 10.35.64.10 };]), log id: 48d99178
2012-01-18 17:48:43,050 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-47) FINISH, ConnectStorageServerVDSCommand, return: {9444d223-8107-4267-8b39-613be47f6924=0, e5bf1c79-c5f7-466e-b78d-69dcd3d9234b=0, 803b1efd-c40a-419d-b557-21aefcf6323b=0}, log id: 48d99178
2012-01-18 17:48:44,522 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-48) FINISH, ConnectStorageServerVDSCommand, return: {9444d223-8107-4267-8b39-613be47f6924=0, e5bf1c79-c5f7-466e-b78d-69dcd3d9234b=0, 803b1efd-c40a-419d-b557-21aefcf6323b=0}, log id: 1c899b48

Comment 2 Dafna Ron 2012-01-19 16:08:04 UTC
Danken - can you guys please answer Omer?

Comment 3 Dan Kenigsberg 2012-01-19 23:29:33 UTC
Dafna, does the failing host see the 3b5eeffd-9471-4b80-ba0c-234f29acc6fe vg (that's the suggested master domain)? Does it see the PVs that make it up?

Comment 4 Haim 2012-01-20 16:25:48 UTC
Dafna, I went over the logs, and I have some questions: 
- on spm log, I don't see any connectStorage\iscsiadm\createStorage\ commands - it would be good for analysis and debugging. 
- on hsm log, I followed connectStorageServer, and tried to run the iscsiadm commands manually on the server just for comparison, tried to correlated it with log, after it runs the discovery command, it tries to login to the following targets: 

Dafna-15G 
Dafna-30G
Dafna-20G

the problem is, storage server (10.35.64.10) doesn't expose\have any of those targets:

kiril_export
Daffi
DB1
Dafna-01
Dafna-02
Dafna-03
Dafna-04
Dafna-Main
MasterBackup
MasterBackup-01
iqn.1994-05.com.redhat:aqua-rhel1_0_922165947831233
iqn.1994-05.com.redhat:blue_0_94153343504339
iqn.2011-02.com.redhat.tlv.lab:yeylon_data-03
mgoldboi_qanashead-1
paikov1
paikov2
paikov3
paikov_tmp
rvaknin_test

Comment 5 Dafna Ron 2012-01-23 16:44:05 UTC
After I had to reinstall the host I can no longer reproduce this. 
I am closing this for now and will reopen if reproducible again.

Comment 6 Dafna Ron 2012-01-23 16:47:47 UTC
My mistake - reopening since I was able to reproduce

Comment 7 Dafna Ron 2012-01-23 17:51:41 UTC
after some more testing, I can see that some of the targets could not even be logged in on the failed host. 

host 1 (succeeded) 

[root@blond-vdsf tmp]# iscsiadm -m session
tcp: [154] 10.35.64.10:3260,1 Dafna-01
tcp: [155] 10.35.64.10:3260,1 Dafna-02
tcp: [156] 10.35.64.10:3260,1 Dafna-03
tcp: [157] 10.35.64.10:3260,1 Dafna-04
tcp: [158] 10.35.64.10:3260,1 Dafna-05


host 2 (failed) 

[root@blond-vdsg tmp]# iscsiadm -m session
tcp: [7] 10.35.64.10:3260,1 Dafna-01
tcp: [8] 10.35.64.10:3260,1 Dafna-04
tcp: [9] 10.35.64.10:3260,1 Dafna-03

full logs again will be attached

Comment 8 Dafna Ron 2012-01-23 17:55:31 UTC
Created attachment 557019 [details]
new logs

Comment 9 Dafna Ron 2012-01-24 09:16:38 UTC
Haim found the issue. this was a cache issue with iscsi. 
looking at file /var/lib/iscsi/nodes/Dafna-04/10.35.64.10,3260,1/default node.startup was set to automatic and not manual. 
I cleaned cache under /var/lib/iscsi/nodes and the host was able to connect to the storage. 
closing this as not a bug.


Note You need to log in before you can comment on or make changes to this bug.