Hide Forgot
+++ This bug was initially created as a clone of Bug #681546 +++ Created attachment 481876 [details] VDSM-LOG.gz.9_refreshStoragePool_failed VDSM [2.2.6]: HSM host remained connected to wrong storage pool after refreshStoragePool returned with error, it caused Two SPM hosts to run on same cluster. Scenario: ********** reconstruct master calls 'refreshStoragePool' on HSM hosts, in that case refreshStoragePool failed and the HSM host remained connected to wrong storage pool, when the next spmStatus occurred the pool returned 'free' and spmStart succeeded, the cluster remained with two SPM hosts. VDSM-LOG.gz.9 (refreshStoragePool that failed on HSM host) : ************************************************************** Thread-81469::ERROR::2011-03-01 17:56:12,727::misc::67::irs::Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 978, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/hsm.py", line 396, in public_refreshStoragePool return self._refreshStoragePool(pool, msdUUID, masterVersion) File "/usr/share/vdsm/storage/hsm.py", line 411, in _refreshStoragePool pool.refresh(msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1036, in refresh self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 983, in __rebuild msdUUID = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion).sdUUID File "/usr/share/vdsm/storage/sp.py", line 1223, in getMasterDomain self.masterDomain = self.findMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1293, in findMasterDomain raise e StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=e71e2fce-7fd1-41eb-a557-156cc2ffc964, pool=c01f9132-2f50-42f3-b6cf-389be95657ea' VDSM-LOG.gz.4 - next spmStatus/spmStart succeeded: ***************************************************** Thread-115067::DEBUG::2011-03-02 09:51:44,527::metadata::89::irs::metadata.__validate() __cksum(md) = b57d381fc4bd6f0044c85fb3fc4a923c4773c14e Thread-115067::DEBUG::2011-03-02 09:51:44,527::spm::594::irs::(SPM.status) spUUID=c01f9132-2f50-42f3-b6cf-389be95657ea: spmStatus=Free spmLver=-1 spmId=-1 Thread-115067::DEBUG::2011-03-02 09:51:44,527::task::577::irs::Task 5821964a-d411-4dbf-889f-a5cbc879dd8e: finished: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}} Thread-115067::DEBUG::2011-03-02 09:51:44,528::task::577::irs::Task 5821964a-d411-4dbf-889f-a5cbc879dd8e: moving from state preparing -> state finished Thread-115067::DEBUG::2011-03-02 09:51:44,528::resource::670::irs::Owner.releaseAll requests [] resources [] Thread-115067::DEBUG::2011-03-02 09:51:44,528::task::577::irs::Task 5821964a-d411-4dbf-889f-a5cbc879dd8e: ref 0 aborting False Thread-115067::INFO::2011-03-02 09:51:44,528::dispatcher::101::irs::Run and protect: getSpmStatus, Return response: {'status': {'message': 'OK', 'code': 0}, 'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}} Thread-115068::INFO::2011-03-02 09:51:44,575::dispatcher::95::irs::Run and protect: spmStart, args: ( spUUID=c01f9132-2f50-42f3-b6cf-389be95657ea prevID=-1 prevLVER=-1 recoveryMode=0 scsiFencing=false maxHostID=250) Thread-115068::DEBUG::2011-03-02 09:51:44,576::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: moving from state init -> state preparing Thread-115068::DEBUG::2011-03-02 09:51:44,576::resource::298::irs::Resource Storage/c01f9132-2f50-42f3-b6cf-389be95657ea: f8a3883c-7c11-4232-a8ef-890a5b611fc2 acquire shared (120000) Thread-115068::DEBUG::2011-03-02 09:51:44,576::resource::304::irs::Resource Storage/c01f9132-2f50-42f3-b6cf-389be95657ea - lockstate free Thread-115068::DEBUG::2011-03-02 09:51:44,577::resource::239::irs::Resource 'Storage/c01f9132-2f50-42f3-b6cf-389be95657ea': __granted shared to 'f8a3883c-7c11-4232-a8ef-890a5b611fc2' Thread-115068::DEBUG::2011-03-02 09:51:44,577::resource::524::irs::Owner f8a3883c-7c11-4232-a8ef-890a5b611fc2: _acquired Storage/c01f9132-2f50-42f3-b6cf-389be95657ea Thread-115068::DEBUG::2011-03-02 09:51:44,577::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: _resourcesAcquired: Storage.c01f9132-2f50-42f3-b6cf-389be95657ea (shared) Thread-115068::DEBUG::2011-03-02 09:51:44,577::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: ref 1 aborting False Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: Prepare: 1 jobs exist, move to acquiring Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: moving from state preparing -> state acquiring Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::478::irs::TaskResourcesList.allGranted: check for {'Storage!c01f9132-2f50-42f3-b6cf-389be95657ea': <storage.task.TaskResource object at 0x2aaab0199810>} owned {'Storage!c01f9132-2f50-42f3-b6cf-389be95657ea': ('Storage', 'c01f9132-2f50-42f3-b6cf-389be95657ea', 'shared')} Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: moving from state acquiring -> state queued Thread-115068::DEBUG::2011-03-02 09:51:44,578::taskManager::38::irs::(TaskManager:queue) queueing task: f8a3883c-7c11-4232-a8ef-890a5b611fc2 Thread-115068::DEBUG::2011-03-02 09:51:44,579::taskManager::45::irs::(TaskManager:queue) task queued: f8a3883c-7c11-4232-a8ef-890a5b611fc2 Thread-115068::DEBUG::2011-03-02 09:51:44,579::task::1292::irs::(task:prepare) returning Thread-115068::DEBUG::2011-03-02 09:51:44,579::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: ref 0 aborting False Thread-115068::INFO::2011-03-02 09:51:44,579::dispatcher::101::irs::Run and protect: spmStart, Return response: {'status': {'message': 'OK', 'code': 0}, 'uuid': 'f8a3883c-7c11-4232-a8ef-890a5b611fc2'} c9fa3e87-751d-4182-b6dd-8e5d4d4fcc76::DEBUG::2011-03-02 09:51:44,601::threadPool::48::irs::(threadPool:setRunningTask) Number of running tasks: 1 f8a3883c-7c11-4232-a8ef-890a5b611fc2::DEBUG::2011-03-02 09:51:44,601::threadPool::192::irs::Task: f8a3883c-7c11-4232-a8ef-890a5b611fc2 running: <bound method Task.commit of <storage.task.Task instance at 0xe843710>> with: None --- Additional comment from ohochman on 2011-03-02 16:34:54 IST --- Created attachment 481877 [details] VDSM-LOG.gz.4_next_spmStatus/spmStart adding spmStatus/spmStart part. --- Additional comment from ohochman on 2011-03-02 16:59:42 IST --- Created attachment 481879 [details] less_tree_/rhev/ less_tree_/rhev/ --- Additional comment from ohochman on 2011-03-02 17:01:41 IST --- [root@red-vds1 ~]# vdsClient -s 0 getSpmStatus c01f9132-2f50-42f3-b6cf-389be95657ea spmId = 1 spmStatus = SPM spmLver = 0 [root@red-vds3 ~]# vdsClient -s 0 getSpmStatus c01f9132-2f50-42f3-b6cf-389be95657ea spmId = 1 spmStatus = SPM spmLver = 0 --- Additional comment from abaron on 2011-03-03 11:06:59 IST --- refreshStoragePool with new master domain failed and caused host to remain connected to pool through invalid (old) master domain. getSpmId afterwards succeeded which is bad. Correct behaviour is to (in the least): 1. immediately disconnectStoragePool 2. delete reconnectInfo In addition need to make sure that: 1. refreshStoragePool that succeeds updates reconnect info Last, need to check if we can remove spm_id and spm_lver from metadata of non-master domains (migrate and cleanup should remove these fields, create should not put them in there to begin with). --- Additional comment from abaron on 2011-03-03 11:08:18 IST --- Please make sure to clone this to 2.3 and fix there first.
http://gerrit.usersys.redhat.com/403
Checked on 4.5-65.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: If the refreshing the storage pool of a host storage manager failed, the host would remain connected to the original storage pool but the new storage pool would start as a storage pool manager (SPM), which meant the cluster would have two SPM hosts. Now, if refreshing the storage pool of a host fails, the host is left without a storage pool. If refreshing the storage pool succeeds, the host reconnects to the new storage pool without error.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -If the refreshing the storage pool of a host storage manager failed, the host would remain connected to the original storage pool but the new storage pool would start as a storage pool manager (SPM), which meant the cluster would have two SPM hosts. Now, if refreshing the storage pool of a host fails, the host is left without a storage pool. If refreshing the storage pool succeeds, the host reconnects to the new storage pool without error.+If the refreshStoragePool command on a host storage manager failed while maintaining the host's connection to the original storage pool, the new host would start as a storage pool manager (SPM), which meant the cluster had two SPM hosts. Now, if the refreshStoragePool command fails on a host, the host is left without a storage pool, and the original SPM host remains the only SPM host.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0169.html