Created attachment 704880 [details] logs Description of problem: When trying to connect host to storage pool with an inactive master domain, the operation failed because engine do not perform connectStorageServer on an inactive domain. Insted, the engine is doing connectStorageServer to another domain that is unattatched and then connectStoragePool to the relevant inactive domain. Version-Release number of selected component (if applicable): vdsm-4.10.2-10.0.el6ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Have one host and 2 domains: 1 inactive and 1 unattatched. 2. Activate the host Actual results: The engine will perform connectStorageServer to another domain that is unattatched and then connectStoragePool to the relevant inactive domain. Expected results: Engine should perform connectStorageServer on an inactive domain. Additional info: logs
the reproduction is: 1. in an iscsi DC with 1 data domain, 1 iso domain and 1 host, block the storage domain from the host using iptables 2. once host becomes non-operational add a second host both hosts will be non-operational, the data domain will be inactive and the iso will be unknown. 3. remove the iptables block from the storage. 4. put hosts in maintenance and activate the hosts we send connectStoragePool without connectStroageServer for the data domain but we do send connectStorageServer for the iso domain.
(In reply to comment #2) > the reproduction is: > 1. in an iscsi DC with 1 data domain, 1 iso domain and 1 host, block the > storage domain from the host using iptables > 2. once host becomes non-operational add a second host > > both hosts will be non-operational, the data domain will be inactive and the > iso will be unknown. > > 3. remove the iptables block from the storage. > > 4. put hosts in maintenance and activate the hosts > > we send connectStoragePool without connectStroageServer for the data domain > but we do send connectStorageServer for the iso domain. Liron, is this use case covered by the bug you're fixing? (if so please close as duplicate)
Ayal, this is seems to be different issue then the other bug (though they somehow related) From brief look - The issue here seems to me that during initVdsOnUp, we perform ConnectStorageServer only to domains that are unknown/active - queried using the stored procedure Getstorage_server_connectionsByStoragePoolId (storage_domains.status in(0,3)); The domain auto recovery process performs the connect operations only for hosts that are in status up, so basically as the hosts should be non operational we won't do connect operation either by this flow to them. so basically IIUC, we can't have success in initvdsonup as we don't connect to the inactive domain storage server, while auto recovery won't help as the hosts aren't in status up. I'm adding Michael for his opinion.
(In reply to comment #4) > Ayal, this is seems to be different issue then the other bug (though they > somehow related) > > From brief look - The issue here seems to me that during initVdsOnUp, we > perform ConnectStorageServer only to domains that are unknown/active - > queried using the stored procedure > Getstorage_server_connectionsByStoragePoolId (storage_domains.status > in(0,3)); This seems wrong to me. Once we split maintenance from inactive, I don't understand why initvdsonup doesn't try to connect to inactive domains (this is exactly the operation that may move them back to active). > > The domain auto recovery process performs the connect operations only for > hosts that are in status up, so basically as the hosts should be non > operational we won't do connect operation either by this flow to them. > > so basically IIUC, we can't have success in initvdsonup as we don't connect > to the inactive domain storage server, while auto recovery won't help as the > hosts aren't in status up. > > I'm adding Michael for his opinion.
I think that bug exist from 3.1 version at least, when InActive status was introduced. First bug , is our code, the following DbFacade.getInstance().getStorageServerConnectionDao().getAllForStoragePool() method not returning all connections but only active/unknown - this is bug. And what is more funny, code has two tests that checking that it will return all domains, nothing to add. For this - fix is easy. Next problem: Connection success but I need to do reconstruct because of error during ConnectStoragePool. Reconstruct will fail, because I have only one/all storage domain which are inactive. (Actually reconstruct will not run at all, because of no new master domain is elected). If reconstruct is called from InitVdsOnUp domain with any status (Active/Unknown/InActive) can be chosen as new master, but Active should has first priority So two fixes
http://gerrit.ovirt.org/#/c/12805/ Fixing connection bug. If will be problem after connect , a new bug should be opened
Verified on SF11. connectStorageServer was sent to the inactive data domain and then connectStoragePool. Thread-38639::INFO::2013-03-25 11:14:27,392::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='f8ed642d-5e2f-4065-b22f-8a3f4d88e318', conList=[{'connection': '10.35.64.81', 'iqn': 'elad203', 'portal': '1', 'user': '', 'password': '******', 'id': 'ad6dccb6-365f-465c-bcb6-25a444169528', 'port': '3260'}], options=None) Thread-38641::INFO::2013-03-25 11:14:28,169::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='f8ed642d-5e2f-4065-b22f-8a3f4d88e318', hostID=2, scsiKey='f8ed642d-5e2f-4065-b22f-8a3f4d88e318', msdUUID='29daeb89-f858-4d2a-ba38-3027210862a8', masterVersion=1, options=None)
3.2 has been released