Created attachment 738261 [details] logs Description of problem: vdsm reports the domain is partially inaccessible if we put the domain in maintenance during upgrade. Version-Release number of selected component (if applicable): sf13.1 How reproducible: 100% Steps to Reproduce: 1. start upgrade of pool from v1 (3.0) to v3 (3.2) 2. put master domain in maintenance 3. Actual results: we report that the domain is partially inaccessible Expected results: we should complete the upgrade before marking domain in maintenance Additional info: logs I think it might be related to RefreshStoragePool which is sent during upgrade: 2013-04-21 16:54:11,446 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-50) [3df2feeb] Command RefreshStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to RefreshStoragePoolVDS, error = Unknown pool id, pool not connected: ('7f13928f-5e07-4971-a7ac-44ca747c4467',) 2013-04-21 16:54:11,446 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-50) [3df2feeb] FINISH, RefreshStoragePoolVDSCommand, log id: 744980b1 2013-04-21 16:54:11,446 ERROR [org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation] (pool-4-thread-50) [3df2feeb] Failed to refresh storagePool. Host cougar01 to storage pool iSCSI. Exception: {3}: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RefreshStoragePoolVDS, error = Unknown pool id, pool not connected: ('7f13928f-5e07-4971-a7ac-44ca747c4467',) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:167) [engine-bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation.execute(AfterDeactivateSingleAsyncOperation.java:32) [engine-bll.jar:] at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:42) [engine-utils.jar:] at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:31) [engine-utils.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalCallable.call(ThreadPoolUtil.java:99) [engine-utils.jar:] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_09-icedtea] at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_09-icedtea] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_09-icedtea] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_09-icedtea] at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea] 2013-04-21 16:54:11,448 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-40) [3df2feeb] FINISH, RefreshStoragePoolVDSCommand, log id: 3ee32507 2013-04-21 16:54:11,487 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (pool-4-thread-43) [3df2feeb] START, DisconnectStorageServerVDSCommand(HostName = cougar02, HostId = 8d6fddbc-dde5-474c-a436-f26936a4deb0, storagePoolId = 7f13928f-5e07-4971-a7ac-44ca747c4467, storageType = ISCSI, connectionList = [{ id: 7a709e03-1538-434c-9a45-7b747e5184ea, connection: 10.35.64.10, iqn: Dafna-03, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 72b717c1
Fede, please take a look what's going on here.
This issue is not related to the domain being moved to maintenance even though it's probably a valid reproducer. Basically if we send a refreshStoragePool to the HSMs right after the upgrade was completed, the metadata switch (lv-based vs tag-based) in the domain cache didn't happen yet and the master domain fails the validation (empty metadata). @public def refreshStoragePool(self, spUUID, msdUUID, masterVersion, options=None): ... pool = self.getPool(spUUID) try: self.validateSdUUID(msdUUID) pool.refresh(msdUUID, masterVersion) except: ... The problem here is that before the refresh (pool.refresh) which would trigger the metadata switch there's a "validateSdUUID" call for the master domain which runs against the cached domain. Traceback found on the HSM: Thread-77::ERROR::2013-04-21 16:54:11,467::task::850::TaskManager.Task::(_setError) Task=`b6e59a23-2acd-4856-9082-86b47aa09cb9`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 41, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 807, in refreshStoragePool self.validateSdUUID(msdUUID) File "/usr/share/vdsm/storage/hsm.py", line 254, in validateSdUUID sdDom.validate() File "/usr/share/vdsm/storage/blockSD.py", line 816, in validate raise se.StorageDomainAccessError(self.sdUUID) StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('dd713883-398a-45db-ae2a-ec6de8751874',) Where: def validate(self): ... if not len(self.getMetadata()): raise se.StorageDomainAccessError(self.sdUUID) The metadata is empty (len == 0) because it has been just moved to the vg tags. Looking at the SPM logs we have the prove that the domain was just upgraded: Thread-91::DEBUG::2013-04-21 16:53:41,053::formatConverter::336::Storage.v3DomainConverter::(v3DomainConverter) Switching the cluster lock for domain dd713883-398a-45db-ae2a-ec6de8751874 The solution for this will probably involve removing (or somehow improving) the preliminary validateSdUUID call.
The failed refresh will cause the host to disconnect from the pool and then host recovery will reconnect it so although the problem is real, the impact is not that great.
Fede, does the flow change with getting rid of pool metadata changes?
(In reply to Ayal Baron from comment #5) > Fede, does the flow change with getting rid of pool metadata changes? This may not fail when using the new memory based pool backend, but since we must support the old backend, we must fix this.
(In reply to Nir Soffer from comment #6) > (In reply to Ayal Baron from comment #5) > > Fede, does the flow change with getting rid of pool metadata changes? > > This may not fail when using the new memory based pool backend, but since we > must support the old backend, we must fix this. actually considering the impact is not very big (this sorts itself out automatically after a while), if this will not be relevant with the memory based pool backend then it will not really be worth the trouble. Fede?
(In reply to Ayal Baron from comment #7) > (In reply to Nir Soffer from comment #6) > > (In reply to Ayal Baron from comment #5) > > > Fede, does the flow change with getting rid of pool metadata changes? > > > > This may not fail when using the new memory based pool backend, but since we > > must support the old backend, we must fix this. > > actually considering the impact is not very big (this sorts itself out > automatically after a while), if this will not be relevant with the memory > based pool backend then it will not really be worth the trouble. > Fede? I agree that the impact is not big but the fix should be easy. If understand correctly the problem is that we try to validate the master domain (on HSM) before refreshing and the check fails because the metadata was moved somewhere else. Obviously when a refreshStoragePool is sent we cannot try to validate anything before satisfying the refresh request. I think that the fix may be as easy as moving the validation after the refresh (or even remove it completely).
(In reply to Federico Simoncelli from comment #8) > (In reply to Ayal Baron from comment #7) > > (In reply to Nir Soffer from comment #6) > > > (In reply to Ayal Baron from comment #5) > > > > Fede, does the flow change with getting rid of pool metadata changes? > > > > > > This may not fail when using the new memory based pool backend, but since we > > > must support the old backend, we must fix this. > > > > actually considering the impact is not very big (this sorts itself out > > automatically after a while), if this will not be relevant with the memory > > based pool backend then it will not really be worth the trouble. > > Fede? > > I agree that the impact is not big but the fix should be easy. > > If understand correctly the problem is that we try to validate the master > domain (on HSM) before refreshing and the check fails because the metadata > was moved somewhere else. > > Obviously when a refreshStoragePool is sent we cannot try to validate > anything before satisfying the refresh request. > > I think that the fix may be as easy as moving the validation after the > refresh (or even remove it completely). If this is indeed the case then by all means, let's get a patch.