Bug 1293666 - getStorageDomainsList successfully reports an empty list if a gluster volume with a storage domain is in Read-Only mode
Summary: getStorageDomainsList successfully reports an empty list if a gluster volume ...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.17.13
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-3.6.3
: 4.17.13
Assignee: Nir Soffer
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-22 15:50 UTC by Simone Tiraboschi
Modified: 2016-03-09 21:49 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-12-22 20:58:59 UTC
oVirt Team: Storage
Embargoed:
amureini: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
hosted-engine-setup and VDSM logs (756.88 KB, application/x-gzip)
2015-12-22 15:57 UTC, Simone Tiraboschi
no flags Details

Description Simone Tiraboschi 2015-12-22 15:50:22 UTC
Description of problem:
getStorageDomainsList successfully reports an empty list if a gluster volume with a storage domain is in Read-Only mode

Previous history: probably due to this one https://bugzilla.redhat.com/show_bug.cgi?id=1288979 , gluster volume loosed its quorum and went into read only mode.

Now the user tries to deploy the third hosted engine host but hosted-engine-setup reports that there is no previous storage domain on the gluster volume and so it assumes that it's the first host.

Indeed:

2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._getExistingDomain:476 _getExistingDomain
2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._storageServerConnection:638 connectStorageServer
2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._storageServerConnection:701 {'status': {'message': 'OK', 'code': 0}, 'statuslist': [{'status': 0, 'id': '67ece152-dd66-444c-8d18-4249d1b8f488'}]}
2015-12-21 11:29:58 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._getStorageDomainsList:595 getStorageDomainsList
2015-12-21 11:29:59 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._getStorageDomainsList:598 {'status': {'message': 'OK', 'code': 0}, 'domlist': []}

connectStorageServer on the gluster volume successfully completed, getStorageDomainsList successfully reports ('status': {'message': 'OK', 'code': 0}) that there is no storageDomain there ('domlist': []).

In the VDSM logs we can found:

Thread-141::DEBUG::2015-12-21 11:29:59,666::fileSD::157::Storage.StorageDomainManifest::(__init__) Reading domain in path /rhev/data-center/mnt/glusterSD/localhost:_engine/e89b6e64-bd7d-4846-b970-9af32a3295ee
Thread-141::DEBUG::2015-12-21 11:29:59,666::__init__::320::IOProcessClient::(_run) Starting IOProcess...
Thread-141::DEBUG::2015-12-21 11:29:59,680::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with FileMetadataRW backend
Thread-141::ERROR::2015-12-21 11:29:59,686::hsm::2898::Storage.HSM::(getStorageDomainsList) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2882, in getStorageDomainsList
    dom = sdCache.produce(sdUUID=sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 100, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/glusterSD.py", line 32, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 198, in __init__
    validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint)
  File "/usr/share/vdsm/storage/fileSD.py", line 93, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 350, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 543, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 427, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 30] Read-only file system

but then instead of reporting a failure or an error to hosted-engine-setup, it reported a successfully execution where it wasn't able to find any storage domain there: 

Thread-141::INFO::2015-12-21 11:29:59,702::logUtils::51::dispatcher::(wrapper) Run and protect: getStorageDomainsList, Return response: {'domlist': []}
Thread-141::DEBUG::2015-12-21 11:29:59,702::task::1191::Storage.TaskManager.Task::(prepare) Task=`96a9ea03-dc13-483e-9b17-b55a759c9b44`::finished: {'domlist': []}
Thread-141::DEBUG::2015-12-21 11:29:59,702::task::595::Storage.TaskManager.Task::(_updateState) Task=`96a9ea03-dc13-483e-9b17-b55a759c9b44`::moving from state preparing -> state finished
Thread-141::DEBUG::2015-12-21 11:29:59,703::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-141::DEBUG::2015-12-21 11:29:59,703::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-141::DEBUG::2015-12-21 11:29:59,703::task::993::Storage.TaskManager.Task::(_decref) Task=`96a9ea03-dc13-483e-9b17-b55a759c9b44`::ref 0 aborting False
Thread-141::INFO::2015-12-21 11:29:59,704::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:39718 stopped

 


Version-Release number of selected component (if applicable):
4.17.13

How reproducible:
100%

Steps to Reproduce:
1. create a gluster volume
2. create a storage domain on that
3. make the gluster volume read-only
4. connect the gluster volume with connectStorageServer
5. look for storage domains there

Actual results:
There is an exception in VDSM logs but then getStorageDomainsList responds with {'status': {'message': 'OK', 'code': 0}, 'domlist': []}


Expected results:
a. getStorageDomainsList reports a failed execution (code!=0)
or
b. getStorageDomainsList reports a successfully execution (code=0) and it included that storageDomain flagging it as failed

Additional info:

Comment 1 Simone Tiraboschi 2015-12-22 15:57:50 UTC
Created attachment 1108669 [details]
hosted-engine-setup and VDSM logs

Comment 2 Nir Soffer 2015-12-22 20:58:59 UTC
(In reply to Simone Tiraboschi from comment #0)
> Now the user tries to deploy the third hosted engine host but
> hosted-engine-setup reports that there is no previous storage domain on the
> gluster volume and so it assumes that it's the first host.

Hosted engine setup should not make such assumptions and should not decide 
if this is a first time install or adding another host. There can be other
issues causing vdsm not to see the hosted engine storage domain.

The user should select if he wants to add a host or create a the first host.
The installer should fail if storage is not in the expected state for the 
user choice.

I agree that the expected result is nicer, but due to the way vdsm is designed,
we cannot fix this without redesigning the storage domain cache mechanism.

The failure is hidden by the cache mechanism, so getStorageDomainList does not
see the gluster domain in read only mode, and cannot return it.

Since this case is a double failure (less useful response after storage failure),
I don't think we should do anything about this.

Closing for now.

Comment 3 Sandro Bonazzola 2015-12-23 09:51:23 UTC
So basically you suggest to workaround a vdsm bug by moving the decision to the user?

Comment 4 Nir Soffer 2015-12-23 12:12:31 UTC
(In reply to Sandro Bonazzola from comment #3)
> So basically you suggest to workaround a vdsm bug by moving the decision to
> the user?

No, I'm suggesting that the setup tool should not try to take decisions
for the user. The setup tool can show the state of the system and suggest
the best action, like adding a new host if find an existing storage domain.
But this is not in the scope of this bug.

In this bug, vdsm cannot find the domain after a storage server failure.
We don't have a way to report such failures due to the way these failures
are handled in the storage domain cache.


Note You need to log in before you can comment on or make changes to this bug.