Description of problem: getStorageDomainsList successfully reports an empty list if a gluster volume with a storage domain is in Read-Only mode Previous history: probably due to this one https://bugzilla.redhat.com/show_bug.cgi?id=1288979 , gluster volume loosed its quorum and went into read only mode. Now the user tries to deploy the third hosted engine host but hosted-engine-setup reports that there is no previous storage domain on the gluster volume and so it assumes that it's the first host. Indeed: 2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._getExistingDomain:476 _getExistingDomain 2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._storageServerConnection:638 connectStorageServer 2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._storageServerConnection:701 {'status': {'message': 'OK', 'code': 0}, 'statuslist': [{'status': 0, 'id': '67ece152-dd66-444c-8d18-4249d1b8f488'}]} 2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._getStorageDomainsList:595 getStorageDomainsList 2015-12-21 11:29:59 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._getStorageDomainsList:598 {'status': {'message': 'OK', 'code': 0}, 'domlist': []} connectStorageServer on the gluster volume successfully completed, getStorageDomainsList successfully reports ('status': {'message': 'OK', 'code': 0}) that there is no storageDomain there ('domlist': []). In the VDSM logs we can found: Thread-141::DEBUG::2015-12-21 11:29:59,666::fileSD::157::Storage.StorageDomainManifest::(__init__) Reading domain in path /rhev/data-center/mnt/glusterSD/localhost:_engine/e89b6e64-bd7d-4846-b970-9af32a3295ee Thread-141::DEBUG::2015-12-21 11:29:59,666::__init__::320::IOProcessClient::(_run) Starting IOProcess... Thread-141::DEBUG::2015-12-21 11:29:59,680::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with FileMetadataRW backend Thread-141::ERROR::2015-12-21 11:29:59,686::hsm::2898::Storage.HSM::(getStorageDomainsList) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2882, in getStorageDomainsList dom = sdCache.produce(sdUUID=sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 100, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/glusterSD.py", line 32, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/fileSD.py", line 198, in __init__ validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint) File "/usr/share/vdsm/storage/fileSD.py", line 93, in validateFileSystemFeatures oop.getProcessPool(sdUUID).directTouch(testFilePath) File "/usr/share/vdsm/storage/outOfProcess.py", line 350, in directTouch ioproc.touch(path, flags, mode) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 543, in touch self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 427, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 30] Read-only file system but then instead of reporting a failure or an error to hosted-engine-setup, it reported a successfully execution where it wasn't able to find any storage domain there: Thread-141::INFO::2015-12-21 11:29:59,702::logUtils::51::dispatcher::(wrapper) Run and protect: getStorageDomainsList, Return response: {'domlist': []} Thread-141::DEBUG::2015-12-21 11:29:59,702::task::1191::Storage.TaskManager.Task::(prepare) Task=`96a9ea03-dc13-483e-9b17-b55a759c9b44`::finished: {'domlist': []} Thread-141::DEBUG::2015-12-21 11:29:59,702::task::595::Storage.TaskManager.Task::(_updateState) Task=`96a9ea03-dc13-483e-9b17-b55a759c9b44`::moving from state preparing -> state finished Thread-141::DEBUG::2015-12-21 11:29:59,703::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-141::DEBUG::2015-12-21 11:29:59,703::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-141::DEBUG::2015-12-21 11:29:59,703::task::993::Storage.TaskManager.Task::(_decref) Task=`96a9ea03-dc13-483e-9b17-b55a759c9b44`::ref 0 aborting False Thread-141::INFO::2015-12-21 11:29:59,704::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:39718 stopped Version-Release number of selected component (if applicable): 4.17.13 How reproducible: 100% Steps to Reproduce: 1. create a gluster volume 2. create a storage domain on that 3. make the gluster volume read-only 4. connect the gluster volume with connectStorageServer 5. look for storage domains there Actual results: There is an exception in VDSM logs but then getStorageDomainsList responds with {'status': {'message': 'OK', 'code': 0}, 'domlist': []} Expected results: a. getStorageDomainsList reports a failed execution (code!=0) or b. getStorageDomainsList reports a successfully execution (code=0) and it included that storageDomain flagging it as failed Additional info:
Created attachment 1108669 [details] hosted-engine-setup and VDSM logs
(In reply to Simone Tiraboschi from comment #0) > Now the user tries to deploy the third hosted engine host but > hosted-engine-setup reports that there is no previous storage domain on the > gluster volume and so it assumes that it's the first host. Hosted engine setup should not make such assumptions and should not decide if this is a first time install or adding another host. There can be other issues causing vdsm not to see the hosted engine storage domain. The user should select if he wants to add a host or create a the first host. The installer should fail if storage is not in the expected state for the user choice. I agree that the expected result is nicer, but due to the way vdsm is designed, we cannot fix this without redesigning the storage domain cache mechanism. The failure is hidden by the cache mechanism, so getStorageDomainList does not see the gluster domain in read only mode, and cannot return it. Since this case is a double failure (less useful response after storage failure), I don't think we should do anything about this. Closing for now.
So basically you suggest to workaround a vdsm bug by moving the decision to the user?
(In reply to Sandro Bonazzola from comment #3) > So basically you suggest to workaround a vdsm bug by moving the decision to > the user? No, I'm suggesting that the setup tool should not try to take decisions for the user. The setup tool can show the state of the system and suggest the best action, like adding a new host if find an existing storage domain. But this is not in the scope of this bug. In this bug, vdsm cannot find the domain after a storage server failure. We don't have a way to report such failures due to the way these failures are handled in the storage domain cache.