Bug 787673

Summary: ovirt-engine-core: Master reconstruct leads to inconsistent state
Product: [Retired] oVirt Reporter: Jakub Libosvar <jlibosva>
Component: ovirt-engine-coreAssignee: mkublin <mkublin>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: acathrow, bazulay, iheim, oramraz, ykaul, yzaslavs
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.1   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 800839 (view as bug list) Environment:
Last Closed: 2012-08-09 08:02:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 800839, 833504    
Attachments:
Description Flags
Logs
none
Logs none

Description Jakub Libosvar 2012-02-06 13:41:46 UTC
Created attachment 559633 [details]
Logs

Description of problem:
My setup is following: one Fedora 16 node, two iscsi storage domains - each on different storage server. I have running VM on non-master domain. On node I corrupted metadata master version (MDT_MASTER_VERSION) and set it to value -1 therefore checksum of metadata is wrong. I triggered master reconstruct by using getStorageDomainInfo on master domain. Reconstruct has started but after few minutes state of setup is following:
 - single node became SPM again.
 - DC is Non-responsive
 - master storage domain is Inactive
 - non-master SD is Unknown

Version-Release number of selected component (if applicable):
vdsm-4.9.3.2-0.fc16.x86_64
ovirt-engine-3.0.0_0001-1.4.fc16.x86_64

How reproducible:
30%

Steps to Reproduce:
1. Have one host, two iscsi domains in DC
2. Corrupt metadata on master doman (I changed MDT_MASTER_VERSION to -1)
3. Get info about master storage domain
  
Actual results:
DC is broken

Expected results:
Non-master storage domain becomes new master and former master goes to inactive. Host stays up so as DC.

Additional info:
Backend, vdsm, spm logs are attached

Comment 1 Jakub Libosvar 2012-02-06 14:15:38 UTC
Forgot to add that VM ends in Non-responding state.

Comment 2 lpeer 2012-02-06 14:41:09 UTC
lowering severity from urgent to high, as this does not have to be fixed in short cycle.

Comment 3 Jakub Libosvar 2012-02-06 14:59:53 UTC
Note that whole data-center including all VMs are lost. There is no way how to bring DC up again - even after storage metadata are correct. No re-initialize exists so far therefore DC must be forcibly removed.

Comment 4 Jakub Libosvar 2012-02-06 15:34:39 UTC
Another interesting thing (maybe vdsm bug?) can be seen in vdsm log is that host is in Contending status after spmStart for 5 minutes before becoming SPM of non-responsive data-center:


Thread-3126::DEBUG::2012-02-06 12:28:37,508::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26]
Thread-3126::DEBUG::2012-02-06 12:28:37,509::task::588::TaskManager.Task::(_updateState) Task=`673ca1b3-3b36-45bd-a7ad-468ba4128b3a`::moving from state init -> state preparing
Thread-3126::INFO::2012-02-06 12:28:37,509::logUtils::37::dispatcher::(wrapper) Run and protect: spmStart(spUUID='b8302308-ca42-48bc-94cf-7261e528014d', prevID=-1, prevLVER='-1', recoveryMode=0, scsiFencing='false', maxHostID=250, domVersion='2', options=None)
Thread-3126::DEBUG::2012-02-06 12:28:37,510::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.b8302308-ca42-48bc-94cf-7261e528014d`ReqID=`e7200cd4-b174-44e7-8746-b8deb3f1464b`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '406' at 'spmStart'
Thread-3126::DEBUG::2012-02-06 12:28:37,510::resourceManager::483::ResourceManager::(registerResource) Trying to register resource 'Storage.b8302308-ca42-48bc-94cf-7261e528014d' for lock type 'exclusive'
Thread-3126::DEBUG::2012-02-06 12:28:37,511::resourceManager::507::ResourceManager::(registerResource) Resource 'Storage.b8302308-ca42-48bc-94cf-7261e528014d' is currently locked, Entering queue (1 in queue

...

Thread-3390::DEBUG::2012-02-06 12:33:05,468::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26]
Thread-3390::DEBUG::2012-02-06 12:33:05,469::task::588::TaskManager.Task::(_updateState) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::moving from state init -> state preparing
Thread-3390::INFO::2012-02-06 12:33:05,469::logUtils::37::dispatcher::(wrapper) Run and protect: getTaskStatus(taskID='673ca1b3-3b36-45bd-a7ad-468ba4128b3a', spUUID=None, options=None)
Thread-3390::DEBUG::2012-02-06 12:33:05,470::taskManager::93::TaskManager::(getTaskStatus) Entry. taskID: 673ca1b3-3b36-45bd-a7ad-468ba4128b3a
Thread-3390::DEBUG::2012-02-06 12:33:05,470::taskManager::96::TaskManager::(getTaskStatus) Return. Response: {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '673ca1b3-3b36-45bd-a7ad-468ba4128b3a'}
Thread-3390::INFO::2012-02-06 12:33:05,471::logUtils::39::dispatcher::(wrapper) Run and protect: getTaskStatus, Return response: {'taskStatus': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '673ca1b3-3b36-45bd-a7ad-468ba4128b3a'}}
Thread-3390::DEBUG::2012-02-06 12:33:05,471::task::1174::TaskManager.Task::(prepare) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::finished: {'taskStatus': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '673ca1b3-3b36-45bd-a7ad-468ba4128b3a'}}
Thread-3390::DEBUG::2012-02-06 12:33:05,471::task::588::TaskManager.Task::(_updateState) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::moving from state preparing -> state finished
Thread-3390::DEBUG::2012-02-06 12:33:05,472::resourceManager::806::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-3390::DEBUG::2012-02-06 12:33:05,472::resourceManager::841::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-3390::DEBUG::2012-02-06 12:33:05,472::task::980::TaskManager.Task::(_decref) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::ref 0 aborting False

Comment 5 Jakub Libosvar 2012-02-08 12:45:02 UTC
Created attachment 560254 [details]
Logs

Reproducible blocking network connection to master storage in step 2.

Comment 6 mkublin 2012-06-19 13:14:55 UTC
Hi, these bug was found at 3.0 version, I solved cloned bug (800839) from these for 3.1 version.
What I should do with thse one? It has not any flag, anything except TestBlocker

Comment 7 mkublin 2012-06-19 14:15:24 UTC
http://gerrit.ovirt.org/#/c/4862/

Comment 8 Itamar Heim 2012-08-09 08:02:40 UTC
closing ON_QA bugs as oVirt 3.1 was released:
http://www.ovirt.org/get-ovirt/

Comment 9 Itamar Heim 2012-08-09 08:03:46 UTC
closing ON_QA bugs as oVirt 3.1 was released:
http://www.ovirt.org/get-ovirt/