Bug 958044 - On storage problem host is moved to nonoperational however one domain is still accessible
Summary: On storage problem host is moved to nonoperational however one domain is stil...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Maor
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-30 08:45 UTC by Jakub Libosvar
Modified: 2016-02-10 20:40 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-07-11 15:18:51 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Backend, vdsm logs (1.15 MB, application/x-bzip)
2013-04-30 08:45 UTC, Jakub Libosvar
no flags Details
engine log (8.16 MB, text/x-log)
2013-07-11 15:17 UTC, Maor
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 987874 0 unspecified CLOSED Wrong MDT_MASTER_VERSION value doesn't lead to reconstruct on second domain 2021-02-22 00:41:40 UTC

Internal Links: 987874

Description Jakub Libosvar 2013-04-30 08:45:41 UTC
Created attachment 741798 [details]
Backend, vdsm logs

Description of problem:
I have one host in setup and three storage domains. Master and one non-master data domain are on storage server A. Third data-domain is on storage server B. I have one running VM that has one disk on master domain.

I blocked network connection from SPM to storage server A. All storage domains went to Unknown status and after a while SPM went to NonOperational, therefore following ReconstructMaster failed. It should have succeeded on the third domain that was still accessible from SPM.

Version-Release number of selected component (if applicable):
rhevm-3.2.0-10.20.master.el6ev.noarch.rpm
vdsm-4.10.2.0.19.gitf9eda0a

How reproducible:
50%

Steps to Reproduce:
1. Have one host, two storage domains from one storage server and one storage domain from different storage server.
2. Block network communication between SPM and master domain that is on first storage server
3. Wait for reconstruct
  
Actual results:
Host is moved to NonOperational however one storage domain is accessible.

Expected results:
Reconstruct happens on third domain that is accessible, inaccessible domains are disconnected and have inactive status. Host is up and DC is up.

Additional info:
Note that there is some vdsm probably built from source, not sure where does it come from, it's on our QA Jenkins.
Connection is blocked at 2013-04-29 18:53:49,430 using iptables
not sure about regression, this is new test.

Comment 1 Maor 2013-07-09 12:55:56 UTC
Reproduce is needed here, this could have already been solved as part of reconstruct changes

Comment 2 Maor 2013-07-11 15:17:37 UTC
Created attachment 772277 [details]
engine log

Comment 3 Maor 2013-07-11 15:18:51 UTC
reproduced the same scenario.
reconstruct has been succeeded.


2013-07-11 18:07:16,734 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) START, ConnectStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, vds_spm_id = 1, masterDomainId = a91edf4c-fd49-435d-a7dc-2210393f413e, masterVersion = 1), log id: 146037ab
2013-07-11 18:08:21,211 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=83828eac-d311-434d-90e0-ae671a8db853, msdUUID=a91edf4c-fd49-435d-a7dc-2210393f413e']]
2013-07-11 18:08:21,212 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) HostName = pluto-vdsb.eng.lab.tlv.redhat.com
2013-07-11 18:08:21,213 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=83828eac-d311-434d-90e0-ae671a8db853, msdUUID=a91edf4c-fd49-435d-a7dc-2210393f413e'
2013-07-11 18:08:21,215 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) FINISH, ConnectStoragePoolVDSCommand, log id: 146037ab
2013-07-11 18:08:21,216 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-69) IrsBroker::Failed::GetStoragePoolInfoVDS
2013-07-11 18:08:21,217 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-69) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=83828eac-d311-434d-90e0-ae671a8db853, msdUUID=a91edf4c-fd49-435d-a7dc-2210393f413e'
2013-07-11 18:08:21,296 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-6-thread-49) Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: a91edf4c-fd49-435d-a7dc-2210393f413e Type: Storage
2013-07-11 18:08:21,347 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (pool-6-thread-49) START, ResetIrsVDSCommand( storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, ignoreFailoverLimit = false, compatabilityVersion = null, vdsId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, ignoreStopFailed = true), log id: 7820169c
2013-07-11 18:08:21,351 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-6-thread-49) START, SpmStopVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853), log id: 31cd664e
2013-07-11 18:08:21,359 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-6-thread-49) Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value

TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM: ()]]

2013-07-11 18:08:21,362 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-6-thread-49) HostName = pluto-vdsb.eng.lab.tlv.redhat.com
2013-07-11 18:08:21,363 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-6-thread-49) Command HSMGetAllTasksStatusesVDS execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: ()
2013-07-11 18:08:21,364 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-6-thread-49) SpmStopVDSCommand::Stopping SPM on vds pluto-vdsb.eng.lab.tlv.redhat.com, pool id 83828eac-d311-434d-90e0-ae671a8db853
2013-07-11 18:08:21,375 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-6-thread-49) FINISH, SpmStopVDSCommand, log id: 31cd664e
2013-07-11 18:08:21,389 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (pool-6-thread-49) FINISH, ResetIrsVDSCommand, log id: 7820169c
2013-07-11 18:08:21,392 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-6-thread-49) START, DisconnectStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, vds_spm_id = 1), log id: c8c6e31
2013-07-11 18:08:21,398 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-6-thread-49) FINISH, DisconnectStoragePoolVDSCommand, log id: c8c6e31
2013-07-11 18:08:21,404 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-6-thread-49) START, ReconstructMasterVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, vdsSpmId = 1, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, storagePoolName = DC3, masterDomainId = 8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, masterVersion = 2, domainsList = [{ domainId: 8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, status: Unknown };{ domainId: 2dce38f3-3a49-497c-90f0-7cc950f28f56, status: Unknown };{ domainId: a91edf4c-fd49-435d-a7dc-2210393f413e, status: InActive };]), log id: efad4b4
2013-07-11 18:08:31,265 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-46) hostFromVds::selectedVds - pluto-vdsb.eng.lab.tlv.redhat.com, spmStatus Unknown_Pool, storage pool DC3
2013-07-11 18:08:31,272 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-46) START, ConnectStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, vds_spm_id = 1, masterDomainId = 8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, masterVersion = 2), log id: 1a5e353d
2013-07-11 18:09:47,751 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-6-thread-49) FINISH, ReconstructMasterVDSCommand, log id: efad4b4
2013-07-11 18:09:47,760 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-6-thread-50) START, RefreshStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, masterStorageDomainId=8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, masterVersion=2), log id: 48cf6f11
2013-07-11 18:10:00,010 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-8) Autorecovering 1 storage domains
2013-07-11 18:10:00,012 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-8) Autorecovering storage domains id: a91edf4c-fd49-435d-a7dc-2210393f413e
2013-07-11 18:10:00,019 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-8) Running command: ConnectDomainToStorageCommand internal: true. Entities affected :  ID: a91edf4c-fd49-435d-a7dc-2210393f413e Type: Storage
2013-07-11 18:10:00,020 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-8) ConnectDomainToStorage. Before Connect all hosts to pool. Time:7/11/13 6:10 PM
2013-07-11 18:10:00,035 INFO  [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-6-thread-45) Running command: ConnectStorageToVdsCommand internal: true. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: System
2013-07-11 18:10:00,045 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-6-thread-45) START, ConnectStorageServerVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 51688bbd-1b43-45cc-910f-475770d7c563, connection: multipass.eng.lab.tlv.redhat.com:/export/images/rnd/maor/data, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 3a7985d6
2013-07-11 18:10:51,265 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-46) FINISH, ConnectStoragePoolVDSCommand, log id: 1a5e353d
2013-07-11 18:11:51,378 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-6-thread-45) FINISH, ConnectStorageServerVDSCommand, return: {51688bbd-1b43-45cc-910f-475770d7c563=0}, log id: 3a7985d6
2013-07-11 18:11:51,380 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-8) ConnectDomainToStorage. After Connect all hosts to pool. Time:7/11/13 6:11 PM
2013-07-11 18:11:51,642 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-46) hostFromVds::Connected host to pool - selectedVds - pluto-vdsb.eng.lab.tlv.redhat.com, spmStatus Free, storage pool DC3
2013-07-11 18:11:51,670 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-46) starting spm on vds pluto-vdsb.eng.lab.tlv.redhat.com, storage pool DC3, prevId -1, LVER -1
2013-07-11 18:11:51,673 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-46) START, SpmStartVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, prevId=-1, prevLVER=-1, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: b39c77e
2013-07-11 18:11:57,218 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-6-thread-50) FINISH, RefreshStoragePoolVDSCommand, log id: 48cf6f11
2013-07-11 18:11:57,240 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-46) spmStart polling started: taskId = bddc3574-7238-4d20-9f94-d10ef2d26b0d
2013-07-11 18:11:57,247 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-6-thread-49) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, ignoreFailoverLimit = false, compatabilityVersion = null), log id: 34db03b1
2013-07-11 18:11:57,248 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-49) clear domain error-timers for pool 83828eac-d311-434d-90e0-ae671a8db853
2013-07-11 18:11:57,250 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-49) clearing cache for problematic entities in pool 83828eac-d311-434d-90e0-ae671a8db853
2013-07-11 18:11:57,251 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-6-thread-49) FINISH, MarkPoolInReconstructModeVDSCommand, log id: 34db03b1
2013-07-11 18:11:57,280 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-6-thread-49) Correlation ID: 77af15aa, Job ID: 8e0e7371-a957-480a-bf03-d44f799b3d42, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center DC3 completed.


Note You need to log in before you can comment on or make changes to this bug.