Previously, due to an issue with pool metadata not refreshing correctly, attempting to put the HSM host into maintenance mode while reconstructing the master domain would result in the changes not being updated in the HSM. The pool metadata issue has been solved so that any changes and updates applied to the HSM in maintenance mode will be retained when it is reactivated.
Description of problem:
if we put the hsm host in maintenance while we reconstruct master then the changes are not updated in hsm
this is caused by simply putting the master domain in maintenance while hsm is also in maintenance
if master domain is the same domain as before (which means version changes) the version is also not updated and the hsm will get wrong master domain or version.
backend is sending disconnectStorageServer so domain should be disconnected
Version-Release number of selected component (if applicable):
vdsm-4.9.6-4.5.x86_64
How reproducible:
100%
Steps to Reproduce:
1. in two hosts cluster add two storage domains
2. put hsm in maintenance and put the master domain in maintenance as well (so that the second domain will become master)
3. activate the hsm -> host will become non-operational with can't find master domain error
4. put the hsm in maintenance again
5. put the new master domain in maintenance so that the old domain will become master again
6. activate the hsm -> host will become non-operational with wrong master version error
Actual results:
hsm is not updated with changes made to master domain while its disconnected from pool.
when we activate the hsm and the master has changed to different location we get can't find master and when the version has changed (if we put master in maintenance) we will get wrong version
Expected results:
hsm should be updated with changes when activated.
Additional info: will attach full logs from both hosts and backend
Thread-350::INFO::2012-03-28 15:11:09,899::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', hostID=2, scsiKey='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', remove=False, options=None)
Thread-351::INFO::2012-03-28 15:11:09,954::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=3, spUUID='8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', conList=[{'connection': '10.35.64.106', 'iqn': 'iqn.1986-03.com.sun:02:dafna112713222714816', 'portal': '1', 'user': '', 'password': '******', 'id': '7c2518d4-f9b6-493f-b32e-fcf1668264b3', 'port': '3260'}, {'connection': '10.35.64.10', 'iqn': 'Dafna-big', 'portal': '1', 'user': '', 'password': '******', 'id': 'b1f1a8b2-f42d-4a33-8bf5-80f39f3d04a7', 'port': '3260'}], options=None)
Thread-347::ERROR::2012-03-28 15:07:51,354::sp::1456::Storage.StoragePool::(getMasterDomain) Requested master domain 83a46a9e-dac2-4513-bb21-a33ff76a495a does not have expected
version 3 it is version 1
Thread-347::DEBUG::2012-03-28 15:07:51,355::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5'
Thread-347::DEBUG::2012-03-28 15:07:51,355::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5' (0 active u
sers)
Thread-347::DEBUG::2012-03-28 15:07:51,356::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5' is free, finding out
if anyone is waiting for it.
Thread-347::DEBUG::2012-03-28 15:07:51,356::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.8ed78e50-b61e-4b84-a5b7-7c17f76f16a5
', Clearing records.
Thread-347::ERROR::2012-03-28 15:07:51,357::task::853::TaskManager.Task::(_setError) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 861, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 813, in connectStoragePool
return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
File "/usr/share/vdsm/storage/hsm.py", line 855, in _connectStoragePool
res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
File "/usr/share/vdsm/storage/sp.py", line 641, in connect
self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
File "/usr/share/vdsm/storage/sp.py", line 1107, in __rebuild
self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
File "/usr/share/vdsm/storage/sp.py", line 1457, in getMasterDomain
raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=83a46a9e-dac2-4513-bb21-a33ff76a495a, pool=8ed78e50-b61e-4b84-a5b7-7c17f76f16a5'
Thread-347::DEBUG::2012-03-28 15:07:51,358::task::872::TaskManager.Task::(_run) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::Task._run: 8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13 ('8e
d78e50-b61e-4b84-a5b7-7c17f76f16a5', 2, '8ed78e50-b61e-4b84-a5b7-7c17f76f16a5', '83a46a9e-dac2-4513-bb21-a33ff76a495a', 3) {} failed - stopping task
Thread-347::DEBUG::2012-03-28 15:07:51,358::task::1199::TaskManager.Task::(stop) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::stopping in state preparing (force False)
Thread-347::DEBUG::2012-03-28 15:07:51,359::task::978::TaskManager.Task::(_decref) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::ref 1 aborting True
Thread-347::INFO::2012-03-28 15:07:51,359::task::1157::TaskManager.Task::(prepare) Task=`8c7b0f3f-4a37-403b-8d67-4ab0f0b44b13`::aborting: Task is aborted: 'Wrong Master domain o
r its version' - code 324
Comment 2Eduardo Warszawski
2012-05-02 13:42:38 UTC
The pool metadata is not refreshed due to an sdc issue. (HSM)
In addition:
The host reach this situation after a lot of misleading operations of the engine, like disconnect the storage and try to connect the pool.
The 2nd SD in the pool, 78627e5c-87f7-4492-bc41-a832c5955492 was unreacheable over the whole log.
In spite of that was choosed as master and was attempted to connect this HSM to a this unreacheable master.
The flow should be revised.
http://gerrit.ovirt.org/#change,4085
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHSA-2012-1508.html