Bug 882907 - vdsm: hsm becomes non-operational when connectStoragePool fails because it cannot read metadata (posix storage)
Summary: vdsm: hsm becomes non-operational when connectStoragePool fails because it ca...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: 3.2.0
Assignee: Federico Simoncelli
QA Contact: Elad
URL:
Whiteboard: storage
Depends On: 890842
Blocks: 887899
TreeView+ depends on / blocked
 
Reported: 2012-12-03 10:27 UTC by Dafna Ron
Modified: 2016-02-10 20:24 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (749.58 KB, application/x-gzip)
2012-12-03 10:27 UTC, Dafna Ron
no flags Details

Description Dafna Ron 2012-12-03 10:27:32 UTC
Created attachment 656530 [details]
logs

Description of problem:

I attached/ativated iso domain and detached the iso domain while hsm is in maintenance 
after activating the hsm host again it becomes non-operational because it cannot read metadata. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-44.0.el6_3.x86_64

How reproducible:

90%

Steps to Reproduce:
1. in two hosts cluster, add 1 posix data domain (I used nfs over gluster storage)
2. attach, activate iso domain
3. put hsm host in maintenance
4. detach the iso domain
5. activate the hsm host
  
Actual results:

hsm host becomes non-operational with error that it cannot read metadata. 

Expected results:

host should read metadata and activate. 

Additional info:logs attached

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1524, in getMasterDomain
    if not domain.isMaster():
  File "/usr/share/vdsm/storage/sd.py", line 734, in isMaster
    return self.getMetaParam(DMDK_ROLE).capitalize() == MASTER_DOMAIN
  File "/usr/share/vdsm/storage/sd.py", line 694, in getMetaParam
    return self._metadata[key]
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__
    raise KeyError(key)
KeyError: 'ROLE'

Thread-2654::ERROR::2012-12-03 10:05:34,836::dispatcher::69::Storage.Dispatcher.Protect::(run) 'ROLE'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1164, in prepare
    raise self.error
KeyError: 'ROLE'


[root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainsList
e62df18b-ef25-4a1b-ae86-bee465af8087

[root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainInfo e62df18b-ef25-4a1b-ae86-bee465af8087
	uuid = e62df18b-ef25-4a1b-ae86-bee465af8087
	pool = ['2d5f297c-185a-470b-8600-208fc3c9b235']
	lver = -1
	version = 3
	role = Master
	remotePath = filer01.qa.lab.tlv.redhat.com:/hateya-posix
	spm_id = -1
	type = POSIXFS
	class = Data
	master_ver = 0
	name = hateya-posix-2-bricks

Comment 1 Haim 2012-12-03 16:56:54 UTC
this seem more frequent on posix, please give it priority.

Comment 2 Daniel Paikov 2012-12-05 14:57:41 UTC
Could no longer reproduce after applying fsimonce's patch: http://gerrit.ovirt.org/#/c/9422/

Comment 3 Daniel Paikov 2012-12-05 15:27:12 UTC
Federico: If your patch does indeed fix this bug, please move this BZ to POST.

Comment 4 Federico Simoncelli 2012-12-10 09:46:41 UTC
Looking at the logs the exceptions happens in connectStoragePool and that is consistent with bug 879253 and with a domain cache issue.

Based on comment 2 we could move this to CLOSED CURRENTRELEASE. Anyway to be extra sure I'm moving this to ON_QA to have it tested again.

Comment 6 Dafna Ron 2012-12-30 15:49:03 UTC
cannot verify on sf2 - blocked by: 
https://bugzilla.redhat.com/show_bug.cgi?id=890842

Comment 8 Elad 2013-03-17 12:30:55 UTC
Verified on SF10. reproduced 3 times.
HSM becomes up after activate it from maintenance and after ISO domain detachment.

Comment 9 Itamar Heim 2013-06-11 09:10:37 UTC
3.2 has been released

Comment 10 Itamar Heim 2013-06-11 09:10:53 UTC
3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:38:13 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.