Bug 807687

Summary: vdsm: hsm becomes non-operational after activation if changes were made to master domain or its version while host was in maintenance
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Eduardo Warszawski <ewarszaw>
Status: CLOSED ERRATA QA Contact: Jakub Libosvar <jlibosva>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: abaron, aburden, bazulay, cpelland, danken, hateya, iheim, ilvovsky, jbiddle, jlibosva, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: vdsm-4.9.6-10 Doc Type: Bug Fix
Doc Text:
Previously, due to an issue with pool metadata not refreshing correctly, attempting to put the HSM host into maintenance mode while reconstructing the master domain would result in the changes not being updated in the HSM. The pool metadata issue has been solved so that any changes and updates applied to the HSM in maintenance mode will be retained when it is reactivated.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 18:56:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-03-28 13:56:27 UTC
Description of problem:

if we put the hsm host in maintenance while we reconstruct master then the changes are not updated in hsm

this is caused by simply putting the master domain in maintenance while hsm is also in maintenance

if master domain is the same domain as before (which means version changes) the version is also not updated and the hsm will get wrong master domain or version. 

backend is sending disconnectStorageServer so domain should be disconnected 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-4.5.x86_64

How reproducible:

100%

Steps to Reproduce:
1. in two hosts cluster add two storage domains
2. put hsm in maintenance and put the master domain in maintenance as well (so that the second domain will become master) 
3. activate the hsm -> host will become non-operational with can't find master domain error
4. put the hsm in maintenance again
5. put the new master domain in maintenance so that the old domain will become master again
6. activate the hsm -> host will become non-operational with wrong master version error
  
Actual results:

hsm is not updated with changes made to master domain while its disconnected from pool. 
when we activate the hsm and the master has changed to different location we get can't find master and when the version has changed (if we put master in maintenance) we will get wrong version

Expected results:

hsm should be updated with changes when activated. 

Additional info: will attach full logs from both hosts and backend

Thread-350::INFO::2012-03-28 15:11:09,899::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', hostID=2, scsiKey='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', remove=False, options=None)

Thread-351::INFO::2012-03-28 15:11:09,954::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=3, spUUID='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', conList=[{'connection': '10.35.64.106', 'iqn': 'iqn.1986-03.com.sun:02:dafna112713222714816', 'portal': '1', 'user': '', 'password': '******', 'id': '7c2518d4-f9b6-493f-b32e-fcf1668264b3', 'port': '3260'}, {'connection': '10.35.64.10', 'iqn': 'Dafna-big', 'portal': '1', 'user': '', 'password': '******', 'id': 'b1f1a8b2-f42d-4a33-8bf5-80f39f3d04a7', 'port': '3260'}], options=None)


Thread-347::ERROR::2012-03-28 15:07:51,354::sp::1456::Storage.StoragePool::(getMasterDomain) Requested master domain 83a46a9e-dac2-4513-bb21-a33ff76a495a does not have expected 
version 3 it is version 1
Thread-347::DEBUG::2012-03-28 15:07:51,355::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5'
Thread-347::DEBUG::2012-03-28 15:07:51,355::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5' (0 active u
sers)
Thread-347::DEBUG::2012-03-28 15:07:51,356::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5' is free, finding out
 if anyone is waiting for it.
Thread-347::DEBUG::2012-03-28 15:07:51,356::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5
', Clearing records.
Thread-347::ERROR::2012-03-28 15:07:51,357::task::853::TaskManager.Task::(_setError) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 813, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 855, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 641, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1107, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1457, in getMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=83a46a9e-dac2-4513-bb21-a33ff76a495a, pool=8ed78e50-b61e-4b84-a5b7-7c17f76f16a5'
Thread-347::DEBUG::2012-03-28 15:07:51,358::task::872::TaskManager.Task::(_run) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::Task._run: 8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13 ('8e
d78e50-b61e-4b84-a5b7-7c17f76f16a5', 2, '8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', '83a46a9e-dac2-4513-bb21-a33ff76a495a', 3) {} failed - stopping task
Thread-347::DEBUG::2012-03-28 15:07:51,358::task::1199::TaskManager.Task::(stop) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::stopping in state preparing (force False)
Thread-347::DEBUG::2012-03-28 15:07:51,359::task::978::TaskManager.Task::(_decref) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::ref 1 aborting True
Thread-347::INFO::2012-03-28 15:07:51,359::task::1157::TaskManager.Task::(prepare) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::aborting: Task is aborted: 'Wrong Master domain o
r its version' - code 324

Comment 1 Dafna Ron 2012-03-28 13:57:32 UTC
Created attachment 573357 [details]
logs

Comment 2 Eduardo Warszawski 2012-05-02 13:42:38 UTC
The pool metadata is not refreshed due to an sdc issue. (HSM)

In addition:
The host reach this situation after a lot of misleading operations of the engine, like disconnect the storage and try to connect the pool.
The 2nd SD in the pool, 78627e5c-87f7-4492-bc41-a832c5955492 was unreacheable over the whole log.
In spite of that was choosed as master and was attempted to connect this HSM to a this unreacheable master.
The flow should be revised.

http://gerrit.ovirt.org/#change,4085

Comment 4 Jakub Libosvar 2012-05-09 16:22:49 UTC
Verified using vdsm-4.9.6-10.el6.x86_64

Comment 7 errata-xmlrpc 2012-12-04 18:56:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html