Bug 882907
Summary: | vdsm: hsm becomes non-operational when connectStoragePool fails because it cannot read metadata (posix storage) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||
Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.2.0 | CC: | abaron, amureini, bazulay, cpelland, dpaikov, hateya, iheim, lpeer, scohen, ykaul | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | 3.2.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | Type: | Bug | |||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 890842 | ||||||
Bug Blocks: | 887899 | ||||||
Attachments: |
|
this seem more frequent on posix, please give it priority. Could no longer reproduce after applying fsimonce's patch: http://gerrit.ovirt.org/#/c/9422/ Federico: If your patch does indeed fix this bug, please move this BZ to POST. Looking at the logs the exceptions happens in connectStoragePool and that is consistent with bug 879253 and with a domain cache issue. Based on comment 2 we could move this to CLOSED CURRENTRELEASE. Anyway to be extra sure I'm moving this to ON_QA to have it tested again. cannot verify on sf2 - blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=890842 Verified on SF10. reproduced 3 times. HSM becomes up after activate it from maintenance and after ISO domain detachment. 3.2 has been released 3.2 has been released 3.2 has been released |
Created attachment 656530 [details] logs Description of problem: I attached/ativated iso domain and detached the iso domain while hsm is in maintenance after activating the hsm host again it becomes non-operational because it cannot read metadata. Version-Release number of selected component (if applicable): vdsm-4.9.6-44.0.el6_3.x86_64 How reproducible: 90% Steps to Reproduce: 1. in two hosts cluster, add 1 posix data domain (I used nfs over gluster storage) 2. attach, activate iso domain 3. put hsm host in maintenance 4. detach the iso domain 5. activate the hsm host Actual results: hsm host becomes non-operational with error that it cannot read metadata. Expected results: host should read metadata and activate. Additional info:logs attached Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 648, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1524, in getMasterDomain if not domain.isMaster(): File "/usr/share/vdsm/storage/sd.py", line 734, in isMaster return self.getMetaParam(DMDK_ROLE).capitalize() == MASTER_DOMAIN File "/usr/share/vdsm/storage/sd.py", line 694, in getMetaParam return self._metadata[key] File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__ return dec(self._dict[key]) File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__ raise KeyError(key) KeyError: 'ROLE' Thread-2654::ERROR::2012-12-03 10:05:34,836::dispatcher::69::Storage.Dispatcher.Protect::(run) 'ROLE' Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run result = ctask.prepare(self.func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 1164, in prepare raise self.error KeyError: 'ROLE' [root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainsList e62df18b-ef25-4a1b-ae86-bee465af8087 [root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainInfo e62df18b-ef25-4a1b-ae86-bee465af8087 uuid = e62df18b-ef25-4a1b-ae86-bee465af8087 pool = ['2d5f297c-185a-470b-8600-208fc3c9b235'] lver = -1 version = 3 role = Master remotePath = filer01.qa.lab.tlv.redhat.com:/hateya-posix spm_id = -1 type = POSIXFS class = Data master_ver = 0 name = hateya-posix-2-bricks