Bug 700000

Summary: 2.2.z: HSM host remained connected to wrong storage pool after refreshStoragePool returned with error, it caused Two SPM hosts to run on same cluster.
Product: Red Hat Enterprise Linux 5 Reporter: Dan Kenigsberg <danken>
Component: vdsm22Assignee: Saggi Mizrahi <smizrahi>
Status: CLOSED ERRATA QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.6CC: abaron, bazulay, cpelland, dpaikov, iheim, kgrainge, mkenneth, ohochman, sgrinber, smizrahi, srevivo, ykaul
Target Milestone: rcKeywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: storage
Fixed In Version: vdsm22-4.5-65.el5 Doc Type: Bug Fix
Doc Text:
If the refreshStoragePool command on a host storage manager failed while maintaining the host's connection to the original storage pool, the new host would start as a storage pool manager (SPM), which meant the cluster had two SPM hosts. Now, if the refreshStoragePool command fails on a host, the host is left without a storage pool, and the original SPM host remains the only SPM host.
Story Points: ---
Clone Of: 681546 Environment:
Last Closed: 2012-02-21 04:52:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 681546    
Bug Blocks: 729320    

Description Dan Kenigsberg 2011-04-27 08:47:34 UTC
+++ This bug was initially created as a clone of Bug #681546 +++

Created attachment 481876 [details]
VDSM-LOG.gz.9_refreshStoragePool_failed

VDSM [2.2.6]: HSM host remained connected to wrong storage pool after refreshStoragePool returned with error, it caused Two SPM hosts to run on same cluster.

Scenario: 
**********
reconstruct master calls 'refreshStoragePool' on HSM hosts, in that case refreshStoragePool failed and the HSM host remained connected to wrong storage pool, when the next spmStatus occurred the pool returned 'free' and spmStart succeeded, the cluster remained with two SPM hosts.


VDSM-LOG.gz.9 (refreshStoragePool that failed on HSM host) :
**************************************************************
Thread-81469::ERROR::2011-03-01 17:56:12,727::misc::67::irs::Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 978, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/hsm.py", line 396, in public_refreshStoragePool
    return self._refreshStoragePool(pool, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/hsm.py", line 411, in _refreshStoragePool
    pool.refresh(msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1036, in refresh
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 983, in __rebuild
    msdUUID = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion).sdUUID
  File "/usr/share/vdsm/storage/sp.py", line 1223, in getMasterDomain
    self.masterDomain = self.findMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1293, in findMasterDomain
    raise e
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=e71e2fce-7fd1-41eb-a557-156cc2ffc964, pool=c01f9132-2f50-42f3-b6cf-389be95657ea'


VDSM-LOG.gz.4 -  next spmStatus/spmStart succeeded:
*****************************************************
Thread-115067::DEBUG::2011-03-02 09:51:44,527::metadata::89::irs::metadata.__validate() __cksum(md) = b57d381fc4bd6f0044c85fb3fc4a923c4773c14e
Thread-115067::DEBUG::2011-03-02 09:51:44,527::spm::594::irs::(SPM.status) spUUID=c01f9132-2f50-42f3-b6cf-389be95657ea: spmStatus=Free spmLver=-1 spmId=-1
Thread-115067::DEBUG::2011-03-02 09:51:44,527::task::577::irs::Task 5821964a-d411-4dbf-889f-a5cbc879dd8e: finished: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}}
Thread-115067::DEBUG::2011-03-02 09:51:44,528::task::577::irs::Task 5821964a-d411-4dbf-889f-a5cbc879dd8e: moving from state preparing -> state finished
Thread-115067::DEBUG::2011-03-02 09:51:44,528::resource::670::irs::Owner.releaseAll requests [] resources []
Thread-115067::DEBUG::2011-03-02 09:51:44,528::task::577::irs::Task 5821964a-d411-4dbf-889f-a5cbc879dd8e: ref 0 aborting False
Thread-115067::INFO::2011-03-02 09:51:44,528::dispatcher::101::irs::Run and protect: getSpmStatus, Return response: {'status': {'message': 'OK', 'code': 0}, 'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}}
Thread-115068::INFO::2011-03-02 09:51:44,575::dispatcher::95::irs::Run and protect: spmStart, args: ( spUUID=c01f9132-2f50-42f3-b6cf-389be95657ea prevID=-1 prevLVER=-1 recoveryMode=0 scsiFencing=false maxHostID=250)
Thread-115068::DEBUG::2011-03-02 09:51:44,576::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: moving from state init -> state preparing
Thread-115068::DEBUG::2011-03-02 09:51:44,576::resource::298::irs::Resource Storage/c01f9132-2f50-42f3-b6cf-389be95657ea: f8a3883c-7c11-4232-a8ef-890a5b611fc2 acquire shared (120000)
Thread-115068::DEBUG::2011-03-02 09:51:44,576::resource::304::irs::Resource Storage/c01f9132-2f50-42f3-b6cf-389be95657ea - lockstate free
Thread-115068::DEBUG::2011-03-02 09:51:44,577::resource::239::irs::Resource 'Storage/c01f9132-2f50-42f3-b6cf-389be95657ea': __granted shared to 'f8a3883c-7c11-4232-a8ef-890a5b611fc2'
Thread-115068::DEBUG::2011-03-02 09:51:44,577::resource::524::irs::Owner f8a3883c-7c11-4232-a8ef-890a5b611fc2: _acquired Storage/c01f9132-2f50-42f3-b6cf-389be95657ea
Thread-115068::DEBUG::2011-03-02 09:51:44,577::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: _resourcesAcquired: Storage.c01f9132-2f50-42f3-b6cf-389be95657ea (shared)
Thread-115068::DEBUG::2011-03-02 09:51:44,577::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: ref 1 aborting False
Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: Prepare: 1 jobs exist, move to acquiring
Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: moving from state preparing -> state acquiring
Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::478::irs::TaskResourcesList.allGranted: check for {'Storage!c01f9132-2f50-42f3-b6cf-389be95657ea': <storage.task.TaskResource object at 0x2aaab0199810>} owned {'Storage!c01f9132-2f50-42f3-b6cf-389be95657ea': ('Storage', 'c01f9132-2f50-42f3-b6cf-389be95657ea', 'shared')}
Thread-115068::DEBUG::2011-03-02 09:51:44,578::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: moving from state acquiring -> state queued
Thread-115068::DEBUG::2011-03-02 09:51:44,578::taskManager::38::irs::(TaskManager:queue) queueing task: f8a3883c-7c11-4232-a8ef-890a5b611fc2
Thread-115068::DEBUG::2011-03-02 09:51:44,579::taskManager::45::irs::(TaskManager:queue) task queued: f8a3883c-7c11-4232-a8ef-890a5b611fc2
Thread-115068::DEBUG::2011-03-02 09:51:44,579::task::1292::irs::(task:prepare) returning
Thread-115068::DEBUG::2011-03-02 09:51:44,579::task::577::irs::Task f8a3883c-7c11-4232-a8ef-890a5b611fc2: ref 0 aborting False
Thread-115068::INFO::2011-03-02 09:51:44,579::dispatcher::101::irs::Run and protect: spmStart, Return response: {'status': {'message': 'OK', 'code': 0}, 'uuid': 'f8a3883c-7c11-4232-a8ef-890a5b611fc2'}
c9fa3e87-751d-4182-b6dd-8e5d4d4fcc76::DEBUG::2011-03-02 09:51:44,601::threadPool::48::irs::(threadPool:setRunningTask) Number of running tasks: 1
f8a3883c-7c11-4232-a8ef-890a5b611fc2::DEBUG::2011-03-02 09:51:44,601::threadPool::192::irs::Task: f8a3883c-7c11-4232-a8ef-890a5b611fc2 running: <bound method Task.commit of <storage.task.Task instance at 0xe843710>> with: None

--- Additional comment from ohochman on 2011-03-02 16:34:54 IST ---

Created attachment 481877 [details]
VDSM-LOG.gz.4_next_spmStatus/spmStart

adding spmStatus/spmStart part.

--- Additional comment from ohochman on 2011-03-02 16:59:42 IST ---

Created attachment 481879 [details]
less_tree_/rhev/

less_tree_/rhev/

--- Additional comment from ohochman on 2011-03-02 17:01:41 IST ---

[root@red-vds1 ~]# vdsClient -s 0 getSpmStatus c01f9132-2f50-42f3-b6cf-389be95657ea
        spmId = 1
        spmStatus = SPM
        spmLver = 0

[root@red-vds3 ~]# vdsClient -s 0 getSpmStatus c01f9132-2f50-42f3-b6cf-389be95657ea
        spmId = 1
        spmStatus = SPM
        spmLver = 0

--- Additional comment from abaron on 2011-03-03 11:06:59 IST ---

refreshStoragePool with new master domain failed and caused host to remain connected to pool through invalid (old) master domain.  getSpmId afterwards succeeded which is bad.

Correct behaviour is to (in the least):
1. immediately disconnectStoragePool
2. delete reconnectInfo


In addition need to make sure that:
1. refreshStoragePool that succeeds updates reconnect info

Last, need to check if we can remove spm_id and spm_lver from metadata of non-master domains (migrate and cleanup should remove these fields, create should not put them in there to begin with).

--- Additional comment from abaron on 2011-03-03 11:08:18 IST ---

Please make sure to clone this to 2.3 and fix there first.

Comment 1 Saggi Mizrahi 2011-05-16 13:11:50 UTC
http://gerrit.usersys.redhat.com/403

Comment 4 Daniel Paikov 2011-07-24 14:33:49 UTC
Checked on 4.5-65.

Comment 6 Kate Grainger 2011-08-24 04:24:39 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
If the refreshing the storage pool of a host storage manager failed, the host would remain connected to the original storage pool but the new storage pool would start as a storage pool manager (SPM), which meant the cluster would have two SPM hosts. Now, if refreshing the storage pool of a host fails, the host is left without a storage pool. If refreshing the storage pool succeeds, the host reconnects to the new storage pool without error.

Comment 7 Kate Grainger 2011-08-29 00:41:05 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-If the refreshing the storage pool of a host storage manager failed, the host would remain connected to the original storage pool but the new storage pool would start as a storage pool manager (SPM), which meant the cluster would have two SPM hosts. Now, if refreshing the storage pool of a host fails, the host is left without a storage pool. If refreshing the storage pool succeeds, the host reconnects to the new storage pool without error.+If the refreshStoragePool command on a host storage manager failed while maintaining the host's connection to the original storage pool, the new host would start as a storage pool manager (SPM), which meant the cluster had two SPM hosts. Now, if the refreshStoragePool command fails on a host, the host is left without a storage pool, and the original SPM host remains the only SPM host.

Comment 8 errata-xmlrpc 2012-02-21 04:52:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0169.html