Hide Forgot
Created attachment 559633 [details] Logs Description of problem: My setup is following: one Fedora 16 node, two iscsi storage domains - each on different storage server. I have running VM on non-master domain. On node I corrupted metadata master version (MDT_MASTER_VERSION) and set it to value -1 therefore checksum of metadata is wrong. I triggered master reconstruct by using getStorageDomainInfo on master domain. Reconstruct has started but after few minutes state of setup is following: - single node became SPM again. - DC is Non-responsive - master storage domain is Inactive - non-master SD is Unknown Version-Release number of selected component (if applicable): vdsm-4.9.3.2-0.fc16.x86_64 ovirt-engine-3.0.0_0001-1.4.fc16.x86_64 How reproducible: 30% Steps to Reproduce: 1. Have one host, two iscsi domains in DC 2. Corrupt metadata on master doman (I changed MDT_MASTER_VERSION to -1) 3. Get info about master storage domain Actual results: DC is broken Expected results: Non-master storage domain becomes new master and former master goes to inactive. Host stays up so as DC. Additional info: Backend, vdsm, spm logs are attached
Forgot to add that VM ends in Non-responding state.
lowering severity from urgent to high, as this does not have to be fixed in short cycle.
Note that whole data-center including all VMs are lost. There is no way how to bring DC up again - even after storage metadata are correct. No re-initialize exists so far therefore DC must be forcibly removed.
Another interesting thing (maybe vdsm bug?) can be seen in vdsm log is that host is in Contending status after spmStart for 5 minutes before becoming SPM of non-responsive data-center: Thread-3126::DEBUG::2012-02-06 12:28:37,508::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26] Thread-3126::DEBUG::2012-02-06 12:28:37,509::task::588::TaskManager.Task::(_updateState) Task=`673ca1b3-3b36-45bd-a7ad-468ba4128b3a`::moving from state init -> state preparing Thread-3126::INFO::2012-02-06 12:28:37,509::logUtils::37::dispatcher::(wrapper) Run and protect: spmStart(spUUID='b8302308-ca42-48bc-94cf-7261e528014d', prevID=-1, prevLVER='-1', recoveryMode=0, scsiFencing='false', maxHostID=250, domVersion='2', options=None) Thread-3126::DEBUG::2012-02-06 12:28:37,510::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.b8302308-ca42-48bc-94cf-7261e528014d`ReqID=`e7200cd4-b174-44e7-8746-b8deb3f1464b`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '406' at 'spmStart' Thread-3126::DEBUG::2012-02-06 12:28:37,510::resourceManager::483::ResourceManager::(registerResource) Trying to register resource 'Storage.b8302308-ca42-48bc-94cf-7261e528014d' for lock type 'exclusive' Thread-3126::DEBUG::2012-02-06 12:28:37,511::resourceManager::507::ResourceManager::(registerResource) Resource 'Storage.b8302308-ca42-48bc-94cf-7261e528014d' is currently locked, Entering queue (1 in queue ... Thread-3390::DEBUG::2012-02-06 12:33:05,468::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26] Thread-3390::DEBUG::2012-02-06 12:33:05,469::task::588::TaskManager.Task::(_updateState) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::moving from state init -> state preparing Thread-3390::INFO::2012-02-06 12:33:05,469::logUtils::37::dispatcher::(wrapper) Run and protect: getTaskStatus(taskID='673ca1b3-3b36-45bd-a7ad-468ba4128b3a', spUUID=None, options=None) Thread-3390::DEBUG::2012-02-06 12:33:05,470::taskManager::93::TaskManager::(getTaskStatus) Entry. taskID: 673ca1b3-3b36-45bd-a7ad-468ba4128b3a Thread-3390::DEBUG::2012-02-06 12:33:05,470::taskManager::96::TaskManager::(getTaskStatus) Return. Response: {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '673ca1b3-3b36-45bd-a7ad-468ba4128b3a'} Thread-3390::INFO::2012-02-06 12:33:05,471::logUtils::39::dispatcher::(wrapper) Run and protect: getTaskStatus, Return response: {'taskStatus': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '673ca1b3-3b36-45bd-a7ad-468ba4128b3a'}} Thread-3390::DEBUG::2012-02-06 12:33:05,471::task::1174::TaskManager.Task::(prepare) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::finished: {'taskStatus': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '673ca1b3-3b36-45bd-a7ad-468ba4128b3a'}} Thread-3390::DEBUG::2012-02-06 12:33:05,471::task::588::TaskManager.Task::(_updateState) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::moving from state preparing -> state finished Thread-3390::DEBUG::2012-02-06 12:33:05,472::resourceManager::806::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-3390::DEBUG::2012-02-06 12:33:05,472::resourceManager::841::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-3390::DEBUG::2012-02-06 12:33:05,472::task::980::TaskManager.Task::(_decref) Task=`7081e107-9f6a-4f5f-bd46-2de121b491b0`::ref 0 aborting False
Created attachment 560254 [details] Logs Reproducible blocking network connection to master storage in step 2.
Hi, these bug was found at 3.0 version, I solved cloned bug (800839) from these for 3.1 version. What I should do with thse one? It has not any flag, anything except TestBlocker
http://gerrit.ovirt.org/#/c/4862/
closing ON_QA bugs as oVirt 3.1 was released: http://www.ovirt.org/get-ovirt/