Bug 882907

Summary: vdsm: hsm becomes non-operational when connectStoragePool fails because it cannot read metadata (posix storage)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.2.0CC: abaron, amureini, bazulay, cpelland, dpaikov, hateya, iheim, lpeer, scohen, ykaul
Target Milestone: ---Keywords: ZStream
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 890842    
Bug Blocks: 887899    
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-12-03 10:27:32 UTC
Created attachment 656530 [details]
logs

Description of problem:

I attached/ativated iso domain and detached the iso domain while hsm is in maintenance 
after activating the hsm host again it becomes non-operational because it cannot read metadata. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-44.0.el6_3.x86_64

How reproducible:

90%

Steps to Reproduce:
1. in two hosts cluster, add 1 posix data domain (I used nfs over gluster storage)
2. attach, activate iso domain
3. put hsm host in maintenance
4. detach the iso domain
5. activate the hsm host
  
Actual results:

hsm host becomes non-operational with error that it cannot read metadata. 

Expected results:

host should read metadata and activate. 

Additional info:logs attached

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1524, in getMasterDomain
    if not domain.isMaster():
  File "/usr/share/vdsm/storage/sd.py", line 734, in isMaster
    return self.getMetaParam(DMDK_ROLE).capitalize() == MASTER_DOMAIN
  File "/usr/share/vdsm/storage/sd.py", line 694, in getMetaParam
    return self._metadata[key]
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__
    raise KeyError(key)
KeyError: 'ROLE'

Thread-2654::ERROR::2012-12-03 10:05:34,836::dispatcher::69::Storage.Dispatcher.Protect::(run) 'ROLE'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1164, in prepare
    raise self.error
KeyError: 'ROLE'


[root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainsList
e62df18b-ef25-4a1b-ae86-bee465af8087

[root@camel-vdsb ~]# vdsClient -s 0 getStorageDomainInfo e62df18b-ef25-4a1b-ae86-bee465af8087
	uuid = e62df18b-ef25-4a1b-ae86-bee465af8087
	pool = ['2d5f297c-185a-470b-8600-208fc3c9b235']
	lver = -1
	version = 3
	role = Master
	remotePath = filer01.qa.lab.tlv.redhat.com:/hateya-posix
	spm_id = -1
	type = POSIXFS
	class = Data
	master_ver = 0
	name = hateya-posix-2-bricks

Comment 1 Haim 2012-12-03 16:56:54 UTC
this seem more frequent on posix, please give it priority.

Comment 2 Daniel Paikov 2012-12-05 14:57:41 UTC
Could no longer reproduce after applying fsimonce's patch: http://gerrit.ovirt.org/#/c/9422/

Comment 3 Daniel Paikov 2012-12-05 15:27:12 UTC
Federico: If your patch does indeed fix this bug, please move this BZ to POST.

Comment 4 Federico Simoncelli 2012-12-10 09:46:41 UTC
Looking at the logs the exceptions happens in connectStoragePool and that is consistent with bug 879253 and with a domain cache issue.

Based on comment 2 we could move this to CLOSED CURRENTRELEASE. Anyway to be extra sure I'm moving this to ON_QA to have it tested again.

Comment 6 Dafna Ron 2012-12-30 15:49:03 UTC
cannot verify on sf2 - blocked by: 
https://bugzilla.redhat.com/show_bug.cgi?id=890842

Comment 8 Elad 2013-03-17 12:30:55 UTC
Verified on SF10. reproduced 3 times.
HSM becomes up after activate it from maintenance and after ISO domain detachment.

Comment 9 Itamar Heim 2013-06-11 09:10:37 UTC
3.2 has been released

Comment 10 Itamar Heim 2013-06-11 09:10:53 UTC
3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:38:13 UTC
3.2 has been released