Bug 958044

Summary: On storage problem host is moved to nonoperational however one domain is still accessible
Product: Red Hat Enterprise Virtualization Manager Reporter: Jakub Libosvar <jlibosva>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED WORKSFORME QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acathrow, iheim, jkt, lpeer, mlipchuk, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.3.0   
Hardware: All   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-11 15:18:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Backend, vdsm logs
none
engine log none

Description Jakub Libosvar 2013-04-30 08:45:41 UTC
Created attachment 741798 [details]
Backend, vdsm logs

Description of problem:
I have one host in setup and three storage domains. Master and one non-master data domain are on storage server A. Third data-domain is on storage server B. I have one running VM that has one disk on master domain.

I blocked network connection from SPM to storage server A. All storage domains went to Unknown status and after a while SPM went to NonOperational, therefore following ReconstructMaster failed. It should have succeeded on the third domain that was still accessible from SPM.

Version-Release number of selected component (if applicable):
rhevm-3.2.0-10.20.master.el6ev.noarch.rpm
vdsm-4.10.2.0.19.gitf9eda0a

How reproducible:
50%

Steps to Reproduce:
1. Have one host, two storage domains from one storage server and one storage domain from different storage server.
2. Block network communication between SPM and master domain that is on first storage server
3. Wait for reconstruct
  
Actual results:
Host is moved to NonOperational however one storage domain is accessible.

Expected results:
Reconstruct happens on third domain that is accessible, inaccessible domains are disconnected and have inactive status. Host is up and DC is up.

Additional info:
Note that there is some vdsm probably built from source, not sure where does it come from, it's on our QA Jenkins.
Connection is blocked at 2013-04-29 18:53:49,430 using iptables
not sure about regression, this is new test.

Comment 1 Maor 2013-07-09 12:55:56 UTC
Reproduce is needed here, this could have already been solved as part of reconstruct changes

Comment 2 Maor 2013-07-11 15:17:37 UTC
Created attachment 772277 [details]
engine log

Comment 3 Maor 2013-07-11 15:18:51 UTC
reproduced the same scenario.
reconstruct has been succeeded.


2013-07-11 18:07:16,734 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) START, ConnectStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, vds_spm_id = 1, masterDomainId = a91edf4c-fd49-435d-a7dc-2210393f413e, masterVersion = 1), log id: 146037ab
2013-07-11 18:08:21,211 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=83828eac-d311-434d-90e0-ae671a8db853, msdUUID=a91edf4c-fd49-435d-a7dc-2210393f413e']]
2013-07-11 18:08:21,212 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) HostName = pluto-vdsb.eng.lab.tlv.redhat.com
2013-07-11 18:08:21,213 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=83828eac-d311-434d-90e0-ae671a8db853, msdUUID=a91edf4c-fd49-435d-a7dc-2210393f413e'
2013-07-11 18:08:21,215 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-69) FINISH, ConnectStoragePoolVDSCommand, log id: 146037ab
2013-07-11 18:08:21,216 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-69) IrsBroker::Failed::GetStoragePoolInfoVDS
2013-07-11 18:08:21,217 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-69) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=83828eac-d311-434d-90e0-ae671a8db853, msdUUID=a91edf4c-fd49-435d-a7dc-2210393f413e'
2013-07-11 18:08:21,296 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-6-thread-49) Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: a91edf4c-fd49-435d-a7dc-2210393f413e Type: Storage
2013-07-11 18:08:21,347 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (pool-6-thread-49) START, ResetIrsVDSCommand( storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, ignoreFailoverLimit = false, compatabilityVersion = null, vdsId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, ignoreStopFailed = true), log id: 7820169c
2013-07-11 18:08:21,351 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-6-thread-49) START, SpmStopVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853), log id: 31cd664e
2013-07-11 18:08:21,359 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-6-thread-49) Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value

TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM: ()]]

2013-07-11 18:08:21,362 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-6-thread-49) HostName = pluto-vdsb.eng.lab.tlv.redhat.com
2013-07-11 18:08:21,363 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (pool-6-thread-49) Command HSMGetAllTasksStatusesVDS execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: ()
2013-07-11 18:08:21,364 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-6-thread-49) SpmStopVDSCommand::Stopping SPM on vds pluto-vdsb.eng.lab.tlv.redhat.com, pool id 83828eac-d311-434d-90e0-ae671a8db853
2013-07-11 18:08:21,375 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-6-thread-49) FINISH, SpmStopVDSCommand, log id: 31cd664e
2013-07-11 18:08:21,389 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (pool-6-thread-49) FINISH, ResetIrsVDSCommand, log id: 7820169c
2013-07-11 18:08:21,392 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-6-thread-49) START, DisconnectStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, vds_spm_id = 1), log id: c8c6e31
2013-07-11 18:08:21,398 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-6-thread-49) FINISH, DisconnectStoragePoolVDSCommand, log id: c8c6e31
2013-07-11 18:08:21,404 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-6-thread-49) START, ReconstructMasterVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, vdsSpmId = 1, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, storagePoolName = DC3, masterDomainId = 8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, masterVersion = 2, domainsList = [{ domainId: 8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, status: Unknown };{ domainId: 2dce38f3-3a49-497c-90f0-7cc950f28f56, status: Unknown };{ domainId: a91edf4c-fd49-435d-a7dc-2210393f413e, status: InActive };]), log id: efad4b4
2013-07-11 18:08:31,265 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-46) hostFromVds::selectedVds - pluto-vdsb.eng.lab.tlv.redhat.com, spmStatus Unknown_Pool, storage pool DC3
2013-07-11 18:08:31,272 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-46) START, ConnectStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, vds_spm_id = 1, masterDomainId = 8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, masterVersion = 2), log id: 1a5e353d
2013-07-11 18:09:47,751 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-6-thread-49) FINISH, ReconstructMasterVDSCommand, log id: efad4b4
2013-07-11 18:09:47,760 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-6-thread-50) START, RefreshStoragePoolVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, masterStorageDomainId=8d7eb0de-fa02-4294-a4c8-f685ac4b5a57, masterVersion=2), log id: 48cf6f11
2013-07-11 18:10:00,010 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-8) Autorecovering 1 storage domains
2013-07-11 18:10:00,012 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-8) Autorecovering storage domains id: a91edf4c-fd49-435d-a7dc-2210393f413e
2013-07-11 18:10:00,019 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-8) Running command: ConnectDomainToStorageCommand internal: true. Entities affected :  ID: a91edf4c-fd49-435d-a7dc-2210393f413e Type: Storage
2013-07-11 18:10:00,020 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-8) ConnectDomainToStorage. Before Connect all hosts to pool. Time:7/11/13 6:10 PM
2013-07-11 18:10:00,035 INFO  [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-6-thread-45) Running command: ConnectStorageToVdsCommand internal: true. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: System
2013-07-11 18:10:00,045 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-6-thread-45) START, ConnectStorageServerVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 51688bbd-1b43-45cc-910f-475770d7c563, connection: multipass.eng.lab.tlv.redhat.com:/export/images/rnd/maor/data, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 3a7985d6
2013-07-11 18:10:51,265 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-46) FINISH, ConnectStoragePoolVDSCommand, log id: 1a5e353d
2013-07-11 18:11:51,378 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-6-thread-45) FINISH, ConnectStorageServerVDSCommand, return: {51688bbd-1b43-45cc-910f-475770d7c563=0}, log id: 3a7985d6
2013-07-11 18:11:51,380 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-8) ConnectDomainToStorage. After Connect all hosts to pool. Time:7/11/13 6:11 PM
2013-07-11 18:11:51,642 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-46) hostFromVds::Connected host to pool - selectedVds - pluto-vdsb.eng.lab.tlv.redhat.com, spmStatus Free, storage pool DC3
2013-07-11 18:11:51,670 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-46) starting spm on vds pluto-vdsb.eng.lab.tlv.redhat.com, storage pool DC3, prevId -1, LVER -1
2013-07-11 18:11:51,673 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-46) START, SpmStartVDSCommand(HostName = pluto-vdsb.eng.lab.tlv.redhat.com, HostId = 442357d3-ba0e-41a8-98b9-32aa1868b71b, storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, prevId=-1, prevLVER=-1, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: b39c77e
2013-07-11 18:11:57,218 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-6-thread-50) FINISH, RefreshStoragePoolVDSCommand, log id: 48cf6f11
2013-07-11 18:11:57,240 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-46) spmStart polling started: taskId = bddc3574-7238-4d20-9f94-d10ef2d26b0d
2013-07-11 18:11:57,247 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-6-thread-49) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 83828eac-d311-434d-90e0-ae671a8db853, ignoreFailoverLimit = false, compatabilityVersion = null), log id: 34db03b1
2013-07-11 18:11:57,248 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-49) clear domain error-timers for pool 83828eac-d311-434d-90e0-ae671a8db853
2013-07-11 18:11:57,250 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-49) clearing cache for problematic entities in pool 83828eac-d311-434d-90e0-ae671a8db853
2013-07-11 18:11:57,251 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-6-thread-49) FINISH, MarkPoolInReconstructModeVDSCommand, log id: 34db03b1
2013-07-11 18:11:57,280 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-6-thread-49) Correlation ID: 77af15aa, Job ID: 8e0e7371-a957-480a-bf03-d44f799b3d42, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center DC3 completed.