Bug 807687 - vdsm: hsm becomes non-operational after activation if changes were made to master domain or its version while host was in maintenance
vdsm: hsm becomes non-operational after activation if changes were made to ma...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm (Show other bugs)
6.3
Unspecified Unspecified
urgent Severity urgent
: rc
: ---
Assigned To: Eduardo Warszawski
Jakub Libosvar
storage
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-28 09:56 EDT by Dafna Ron
Modified: 2012-12-04 13:56 EST (History)
11 users (show)

See Also:
Fixed In Version: vdsm-4.9.6-10
Doc Type: Bug Fix
Doc Text:
Previously, due to an issue with pool metadata not refreshing correctly, attempting to put the HSM host into maintenance mode while reconstructing the master domain would result in the changes not being updated in the HSM. The pool metadata issue has been solved so that any changes and updates applied to the HSM in maintenance mode will be retained when it is reactivated.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-04 13:56:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (790.20 KB, application/x-gzip)
2012-03-28 09:57 EDT, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2012-03-28 09:56:27 EDT
Description of problem:

if we put the hsm host in maintenance while we reconstruct master then the changes are not updated in hsm

this is caused by simply putting the master domain in maintenance while hsm is also in maintenance

if master domain is the same domain as before (which means version changes) the version is also not updated and the hsm will get wrong master domain or version. 

backend is sending disconnectStorageServer so domain should be disconnected 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-4.5.x86_64

How reproducible:

100%

Steps to Reproduce:
1. in two hosts cluster add two storage domains
2. put hsm in maintenance and put the master domain in maintenance as well (so that the second domain will become master) 
3. activate the hsm -> host will become non-operational with can't find master domain error
4. put the hsm in maintenance again
5. put the new master domain in maintenance so that the old domain will become master again
6. activate the hsm -> host will become non-operational with wrong master version error
  
Actual results:

hsm is not updated with changes made to master domain while its disconnected from pool. 
when we activate the hsm and the master has changed to different location we get can't find master and when the version has changed (if we put master in maintenance) we will get wrong version

Expected results:

hsm should be updated with changes when activated. 

Additional info: will attach full logs from both hosts and backend

Thread-350::INFO::2012-03-28 15:11:09,899::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', hostID=2, scsiKey='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', remove=False, options=None)

Thread-351::INFO::2012-03-28 15:11:09,954::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=3, spUUID='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', conList=[{'connection': '10.35.64.106', 'iqn': 'iqn.1986-03.com.sun:02:dafna112713222714816', 'portal': '1', 'user': '', 'password': '******', 'id': '7c2518d4-f9b6-493f-b32e-fcf1668264b3', 'port': '3260'}, {'connection': '10.35.64.10', 'iqn': 'Dafna-big', 'portal': '1', 'user': '', 'password': '******', 'id': 'b1f1a8b2-f42d-4a33-8bf5-80f39f3d04a7', 'port': '3260'}], options=None)


Thread-347::ERROR::2012-03-28 15:07:51,354::sp::1456::Storage.StoragePool::(getMasterDomain) Requested master domain 83a46a9e-dac2-4513-bb21-a33ff76a495a does not have expected 
version 3 it is version 1
Thread-347::DEBUG::2012-03-28 15:07:51,355::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5'
Thread-347::DEBUG::2012-03-28 15:07:51,355::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5' (0 active u
sers)
Thread-347::DEBUG::2012-03-28 15:07:51,356::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5' is free, finding out
 if anyone is waiting for it.
Thread-347::DEBUG::2012-03-28 15:07:51,356::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5
', Clearing records.
Thread-347::ERROR::2012-03-28 15:07:51,357::task::853::TaskManager.Task::(_setError) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 813, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 855, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 641, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1107, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1457, in getMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=83a46a9e-dac2-4513-bb21-a33ff76a495a, pool=8ed78e50-b61e-4b84-a5b7-7c17f76f16a5'
Thread-347::DEBUG::2012-03-28 15:07:51,358::task::872::TaskManager.Task::(_run) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::Task._run: 8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13 ('8e
d78e50-b61e-4b84-a5b7-7c17f76f16a5', 2, '8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', '83a46a9e-dac2-4513-bb21-a33ff76a495a', 3) {} failed - stopping task
Thread-347::DEBUG::2012-03-28 15:07:51,358::task::1199::TaskManager.Task::(stop) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::stopping in state preparing (force False)
Thread-347::DEBUG::2012-03-28 15:07:51,359::task::978::TaskManager.Task::(_decref) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::ref 1 aborting True
Thread-347::INFO::2012-03-28 15:07:51,359::task::1157::TaskManager.Task::(prepare) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::aborting: Task is aborted: 'Wrong Master domain o
r its version' - code 324
Comment 1 Dafna Ron 2012-03-28 09:57:32 EDT
Created attachment 573357 [details]
logs
Comment 2 Eduardo Warszawski 2012-05-02 09:42:38 EDT
The pool metadata is not refreshed due to an sdc issue. (HSM)

In addition:
The host reach this situation after a lot of misleading operations of the engine, like disconnect the storage and try to connect the pool.
The 2nd SD in the pool, 78627e5c-87f7-4492-bc41-a832c5955492 was unreacheable over the whole log.
In spite of that was choosed as master and was attempted to connect this HSM to a this unreacheable master.
The flow should be revised.

http://gerrit.ovirt.org/#change,4085
Comment 4 Jakub Libosvar 2012-05-09 12:22:49 EDT
Verified using vdsm-4.9.6-10.el6.x86_64
Comment 7 errata-xmlrpc 2012-12-04 13:56:56 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.