Bug 882907 - vdsm: hsm becomes non-operational when connectStoragePool fails because it cannot read metadata (posix storage)
vdsm: hsm becomes non-operational when connectStoragePool fails because it ca...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.2.0
x86_64 Linux
high Severity urgent
: ---
: 3.2.0
Assigned To: Federico Simoncelli
Elad
storage
: ZStream
Depends On: 890842
Blocks: 887899
  Show dependency treegraph
 
Reported: 2012-12-03 05:27 EST by Dafna Ron
Modified: 2016-02-10 15:24 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
logs (749.58 KB, application/x-gzip)
2012-12-03 05:27 EST, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2012-12-03 05:27:32 EST
Created attachment 656530 [details]
logs

Description of problem:

I attached/ativated iso domain and detached the iso domain while hsm is in maintenance 
after activating the hsm host again it becomes non-operational because it cannot read metadata. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-44.0.el6_3.x86_64

How reproducible:

90%

Steps to Reproduce:
1. in two hosts cluster, add 1 posix data domain (I used nfs over gluster storage)
2. attach, activate iso domain
3. put hsm host in maintenance
4. detach the iso domain
5. activate the hsm host
  
Actual results:

hsm host becomes non-operational with error that it cannot read metadata. 

Expected results:

host should read metadata and activate. 

Additional info:logs attached

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1524, in getMasterDomain
    if not domain.isMaster():
  File "/usr/share/vdsm/storage/sd.py", line 734, in isMaster
    return self.getMetaParam(DMDK_ROLE).capitalize() == MASTER_DOMAIN
  File "/usr/share/vdsm/storage/sd.py", line 694, in getMetaParam
    return self._metadata[key]
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__
    raise KeyError(key)
KeyError: 'ROLE'

Thread-2654::ERROR::2012-12-03 10:05:34,836::dispatcher::69::Storage.Dispatcher.Protect::(run) 'ROLE'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1164, in prepare
    raise self.error
KeyError: 'ROLE'


[root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainsList
e62df18b-ef25-4a1b-ae86-bee465af8087

[root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainInfo e62df18b-ef25-4a1b-ae86-bee465af8087
	uuid = e62df18b-ef25-4a1b-ae86-bee465af8087
	pool = ['2d5f297c-185a-470b-8600-208fc3c9b235']
	lver = -1
	version = 3
	role = Master
	remotePath = filer01.qa.lab.tlv.redhat.com:/hateya-posix
	spm_id = -1
	type = POSIXFS
	class = Data
	master_ver = 0
	name = hateya-posix-2-bricks
Comment 1 Haim 2012-12-03 11:56:54 EST
this seem more frequent on posix, please give it priority.
Comment 2 Daniel Paikov 2012-12-05 09:57:41 EST
Could no longer reproduce after applying fsimonce's patch: http://gerrit.ovirt.org/#/c/9422/
Comment 3 Daniel Paikov 2012-12-05 10:27:12 EST
Federico: If your patch does indeed fix this bug, please move this BZ to POST.
Comment 4 Federico Simoncelli 2012-12-10 04:46:41 EST
Looking at the logs the exceptions happens in connectStoragePool and that is consistent with bug 879253 and with a domain cache issue.

Based on comment 2 we could move this to CLOSED CURRENTRELEASE. Anyway to be extra sure I'm moving this to ON_QA to have it tested again.
Comment 6 Dafna Ron 2012-12-30 10:49:03 EST
cannot verify on sf2 - blocked by: 
https://bugzilla.redhat.com/show_bug.cgi?id=890842
Comment 8 Elad 2013-03-17 08:30:55 EDT
Verified on SF10. reproduced 3 times.
HSM becomes up after activate it from maintenance and after ISO domain detachment.
Comment 9 Itamar Heim 2013-06-11 05:10:37 EDT
3.2 has been released
Comment 10 Itamar Heim 2013-06-11 05:10:53 EDT
3.2 has been released
Comment 11 Itamar Heim 2013-06-11 05:38:13 EDT
3.2 has been released

Note You need to log in before you can comment on or make changes to this bug.