Created attachment 556086 [details] logs Description of problem: adding a first domain (master domain) which is constructed from several luns causes the hsm host to become non-operationl since backend are sending ConnectStorageServer to a target which is listed in multipath but we did not ask to add to domain: Thread-434::INFO::2012-01-18 17:47:41,619::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='c7256615-d78f-4e03-b8df-67cd8617068a', hostID=2, scsiKey='c7256615-d78f-4e03-b8df-67cd8617068a', msdUUID='3b5eeffd -9471-4b80-ba0c-234f29acc6fe', masterVersion=1, options=None) Thread-434::DEBUG::2012-01-18 17:47:41,620::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.c7256615-d78f-4e03-b8df-67cd8617068a`ReqID=`2edfaef0-61c0-4dd8-aa2b-87b57a9aac47`::Request was made in '/usr/share/vds m/storage/hsm.py' line '747' at '_connectStoragePool' Version-Release number of selected component (if applicable): vdsm-4.9.3.1-0.fc16.x86_64 device-mapper-multipath-0.4.9-19.fc16.x86_64 ovirt-engine-backend-3.0.0_0001-1.2.fc16.x86_64 How reproducible: Steps to Reproduce: 1. try to create a master domain from several luns in iscsi storage (you need more than one host in cluster) 2. 3. Actual results: hsm host will become non-operational since it cannot connect to pool Expected results: we should not be asking to connect to a target that the user did not ask to connect to. Additional info: full engine and vdsm logs are attched Thread-434::ERROR::2012-01-18 17:47:44,683::sdc::113::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `3b5eeffd-9471-4b80-ba0c-234f29acc6fe` Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 109, in _findDomain return mod.findDomain(sdUUID) File "/usr/share/vdsm/storage/blockSD.py", line 1035, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/blockSD.py", line 259, in __init__ lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize)) File "/usr/share/vdsm/storage/lvm.py", line 911, in checkVGBlockSizes _checkpvsblksize(pvs, vgBlkSize) File "/usr/share/vdsm/storage/lvm.py", line 896, in _checkpvsblksize pvBlkSize = _getpvblksize(pv) File "/usr/share/vdsm/storage/lvm.py", line 891, in _getpvblksize dev = devicemapper.getDmId(os.path.basename(pv)) File "/usr/share/vdsm/storage/devicemapper.py", line 37, in getDmId raise OSError(errno.ENODEV, "Could not find dm device named `%s`" % deviceMultipathName) OSError: [Errno 19] Could not find dm device named `unknown device` Thread-434::DEBUG::2012-01-18 17:47:44,686::resourceManager::535::ResourceManager::(releaseResource) Trying to release resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a' Thread-434::DEBUG::2012-01-18 17:47:44,688::resourceManager::550::ResourceManager::(releaseResource) Released resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a' (0 active users) Thread-434::DEBUG::2012-01-18 17:47:44,689::resourceManager::555::ResourceManager::(releaseResource) Resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a' is free, finding out if anyone is waiting for it. Thread-434::DEBUG::2012-01-18 17:47:44,689::resourceManager::562::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.c7256615-d78f-4e03-b8df-67cd8617068a', Clearing records. Thread-434::ERROR::2012-01-18 17:47:44,691::task::855::TaskManager.Task::(_setError) Task=`e0d97593-5c8f-45a0-a4b1-d219164a1584`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 863, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 721, in connectStoragePool return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 763, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 624, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1097, in __rebuild self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1435, in getMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=c7256615-d78f-4e03-b8df-67cd8617068a, msdUUID=3b5eeffd-9471-4b80-ba0c-234f29acc6fe' Thread-434::DEBUG::2012-01-18 17:47:44,692::task::874::TaskManager.Task::(_run) Task=`e0d97593-5c8f-45a0-a4b1-d219164a1584`::Task._run: e0d97593-5c8f-45a0-a4b1-d219164a1584 ('c7256615-d78f-4e03-b8df-67cd8617068a', 2, 'c7256615-d78f-4e03- b8df-67cd8617068a', '3b5eeffd-9471-4b80-ba0c-234f29acc6fe', 1) {} failed - stopping task
i don't understand why you think this is a backend bug, there is a connect to the pool on 1 host which fails: 2012-01-18 17:48:44,601 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (http--0.0.0.0-8080-2) START, ConnectStoragePoolVDSCommand(vdsId = e0c63588-41e4-11e1-b7ec-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, vds_spm_id = 2, masterDomainId = 3b5eeffd-9471-4b80-ba0c-234f29acc6fe, masterVersion = 1), log id: 2e9762fb 2012-01-18 17:48:49,455 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (http--0.0.0.0-8080-2) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc mStatus Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc mCode 304 mMessage Cannot find master domain: 'spUUID=c7256615-d78f-4e03-b8df-67cd8617068a, msdUUID=3b5eeffd-9471-4b80-ba0c-234f29acc6fe' right after it was created on another host: 2012-01-18 17:48:17,887 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (http--0.0.0.0-8080-2) START, CreateStoragePoolVDSCommand(vdsId = d5178fe8-41e4-11e1-a057-001a4a169741, storagePoolId=c7256615-d78f-4e03-b8df-67cd8617068a, storageType=ISCSI, storagePoolName=Blonds, masterDomainId=3b5eeffd-9471-4b80-ba0c-234f29acc6fe, domainsIdList=[3b5eeffd-9471-4b80-ba0c-234f29acc6fe], masterVersion=1), log id: e89efe6 2012-01-18 17:48:42,024 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStoragePoolVDSCommand] (http--0.0.0.0-8080-2) FINISH, CreateStoragePoolVDSCommand, log id: e89efe6 on the first host it succeed, though: 2012-01-18 17:48:49,543 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (http--0.0.0.0-8080-2) START, ConnectStoragePoolVDSCommand(vdsId = d5178fe8-41e4-11e1-a057-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, vds_spm_id = 1, masterDomainId = 3b5eeffd-9471-4b80-ba0c-234f29acc6fe, masterVersion = 1), log id: 7a8f4a50 2012-01-18 17:48:51,099 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (http--0.0.0.0-8080-2) FINISH, ConnectStoragePoolVDSCommand, log id: 7a8f4a50 you can see the parameters to the connect are the same, of course there was a connect to storage server on both hosts before it all started (same connect parameters, as well...): 2012-01-18 17:48:42,116 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-48) START, ConnectStorageServerVDSCommand(vdsId = e0c63588-41e4-11e1-b7ec-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, storageType = ISCSI, connectionList = [{ id: 9444d223-8107-4267-8b39-613be47f6924, connection: 10.35.64.10 };{ id: e5bf1c79-c5f7-466e-b78d-69dcd3d9234b, connection: 10.35.64.10 };{ id: 803b1efd-c40a-419d-b557-21aefcf6323b, connection: 10.35.64.10 };]), log id: 1c899b48 2012-01-18 17:48:42,119 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-47) START, ConnectStorageServerVDSCommand(vdsId = d5178fe8-41e4-11e1-a057-001a4a169741, storagePoolId = c7256615-d78f-4e03-b8df-67cd8617068a, storageType = ISCSI, connectionList = [{ id: 9444d223-8107-4267-8b39-613be47f6924, connection: 10.35.64.10 };{ id: e5bf1c79-c5f7-466e-b78d-69dcd3d9234b, connection: 10.35.64.10 };{ id: 803b1efd-c40a-419d-b557-21aefcf6323b, connection: 10.35.64.10 };]), log id: 48d99178 2012-01-18 17:48:43,050 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-47) FINISH, ConnectStorageServerVDSCommand, return: {9444d223-8107-4267-8b39-613be47f6924=0, e5bf1c79-c5f7-466e-b78d-69dcd3d9234b=0, 803b1efd-c40a-419d-b557-21aefcf6323b=0}, log id: 48d99178 2012-01-18 17:48:44,522 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-5-thread-48) FINISH, ConnectStorageServerVDSCommand, return: {9444d223-8107-4267-8b39-613be47f6924=0, e5bf1c79-c5f7-466e-b78d-69dcd3d9234b=0, 803b1efd-c40a-419d-b557-21aefcf6323b=0}, log id: 1c899b48
Danken - can you guys please answer Omer?
Dafna, does the failing host see the 3b5eeffd-9471-4b80-ba0c-234f29acc6fe vg (that's the suggested master domain)? Does it see the PVs that make it up?
Dafna, I went over the logs, and I have some questions: - on spm log, I don't see any connectStorage\iscsiadm\createStorage\ commands - it would be good for analysis and debugging. - on hsm log, I followed connectStorageServer, and tried to run the iscsiadm commands manually on the server just for comparison, tried to correlated it with log, after it runs the discovery command, it tries to login to the following targets: Dafna-15G Dafna-30G Dafna-20G the problem is, storage server (10.35.64.10) doesn't expose\have any of those targets: kiril_export Daffi DB1 Dafna-01 Dafna-02 Dafna-03 Dafna-04 Dafna-Main MasterBackup MasterBackup-01 iqn.1994-05.com.redhat:aqua-rhel1_0_922165947831233 iqn.1994-05.com.redhat:blue_0_94153343504339 iqn.2011-02.com.redhat.tlv.lab:yeylon_data-03 mgoldboi_qanashead-1 paikov1 paikov2 paikov3 paikov_tmp rvaknin_test
After I had to reinstall the host I can no longer reproduce this. I am closing this for now and will reopen if reproducible again.
My mistake - reopening since I was able to reproduce
after some more testing, I can see that some of the targets could not even be logged in on the failed host. host 1 (succeeded) [root@blond-vdsf tmp]# iscsiadm -m session tcp: [154] 10.35.64.10:3260,1 Dafna-01 tcp: [155] 10.35.64.10:3260,1 Dafna-02 tcp: [156] 10.35.64.10:3260,1 Dafna-03 tcp: [157] 10.35.64.10:3260,1 Dafna-04 tcp: [158] 10.35.64.10:3260,1 Dafna-05 host 2 (failed) [root@blond-vdsg tmp]# iscsiadm -m session tcp: [7] 10.35.64.10:3260,1 Dafna-01 tcp: [8] 10.35.64.10:3260,1 Dafna-04 tcp: [9] 10.35.64.10:3260,1 Dafna-03 full logs again will be attached
Created attachment 557019 [details] new logs
Haim found the issue. this was a cache issue with iscsi. looking at file /var/lib/iscsi/nodes/Dafna-04/10.35.64.10,3260,1/default node.startup was set to automatic and not manual. I cleaned cache under /var/lib/iscsi/nodes and the host was able to connect to the storage. closing this as not a bug.