Bug 975742

Summary: After power outage, storage reported as up while none of the hosts are up
Product: Red Hat Enterprise Virtualization Manager Reporter: Ohad Basan <obasan>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Ori Gofen <ogofen>
Severity: unspecified Docs Contact:
Priority: high    
Version: 3.2.0CC: acanan, acathrow, amureini, bazulay, eedri, iheim, jkt, laravot, lpeer, mgoldboi, Rhev-m-bugs, tnisan, yeylon, yzaslavs
Target Milestone: ---Flags: laravot: needinfo-
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: av5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1090946    

Description Ohad Basan 2013-06-19 09:09:49 UTC
Description of problem:
After a sudden brutal power outage leads the engine to a state which he "thinks" that there is an activate and up storage domain and all hosts are set on maintenance
this is an impossible state.

Comment 2 Liron Aravot 2013-06-24 07:58:38 UTC
This issue seems spm selection/host maintenance related - IrsBrokerCommand flow.

apparently there was a pool with one inactive domain, 5ed93bcd-753b-4175-b33e-ce3833ff0ada

--------------------------------------------------------
[org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (QuartzScheduler_Worker-73) [7d05f9f9] Running command: ConnectDomainToStorag
eCommand internal: true. Entities affected :  ID: 5ed93bcd-753b-4175-b33e-ce3833ff0ada Type: Storage
2013-06-19 11:10:00,023 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (QuartzScheduler_Worker-73) [7d05f9f9] ConnectDomainToStorage. Before Connect
 all hosts to pool. Time:6/19/13 11:10 AM
2013-06-19 11:10:00,036 INFO  [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (QuartzScheduler_Worker-73) [7d05f9f9] ConnectDomainToStorage. After Connect 
all hosts to pool. Time:6/19/13 11:10 AM
----------------------------------------------------------

later on a host is being activated, buri-04
---------------------------------------------------------
2013-06-19 11:12:07,889 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-40) [4f5a18f4] Host buri04.ci.lab.tlv.redhat
.com storage connection was succeeded 
2013-06-19 11:12:07,901 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-49) START, ConnectStoragePoolVDSCommand(HostName = buri04.ci.lab.tlv.redhat.com, HostId = 089f147e-7b5c-11e2-a3e3-00145e8327d8, storagePoolId = 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6, vds_spm_id = 2, masterDomainId = 5ed93bcd-753b-4175-b33e-ce3833ff0ada, masterVersion = 1), log id: 41683e52
----------------------------------------------------------------


the host succesfully connects to the pool, and as it sees the domain, the domain moves to active status from inactive

----------------------------------------------------------------
ntities affected :  ID: 089f147e-7b5c-11e2-a3e3-00145e8327d8 Type: VDS
2013-06-19 11:12:14,240 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-62) Storage Domain 5ed93bcd-753b-4175-b33e-ce3833ff0ada:xtremio-data-1 was reported by Host buri04.ci.lab.tlv.redhat.com as Active in Pool 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6, moving to active status
----------------------------------------------


spm selection later on fails because another host buri03 seems as spm
-----------------------------------------------
2013-06-19 11:12:15,879 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) hostFromVds::selectedVds - buri04.ci.lab.tlv.redhat.com, spmStatus Free, storage pool Xtremio
2013-06-19 11:12:15,883 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) SPM Init: could not find reported vds or not up - pool:Xtremio vds_spm_id: 1
2013-06-19 11:12:15,914 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) SPM selection - vds seems as spm buri03.ci.lab.tlv.redhat.com
2013-06-19 11:12:15,914 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) spm vds is non responsive, stopping spm selection.
2013-06-19 11:12:25,934 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] hostFromVds::selectedVds - buri04.ci.lab.tlv.redhat.com, spmStatus Free, storage pool Xtremio
2013-06-19 11:12:25,936 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] SPM Init: could not find reported vds or not up - pool:Xtremio vds_spm_id: 1
2013-06-19 11:12:25,939 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] SPM selection - vds seems as spm buri03.ci.lab.tlv.redhat.com
2013-06-19 11:12:25,939 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] spm vds is non responsive, stopping spm selection.
-------------------------------------------------------------

buri04 is being moved to maintenance
-------------------------------------------------------------
2013-06-19 11:13:21,531 INFO  [org.ovirt.engine.core.bll.MaintananceNumberOfVdssCommand] (pool-3-thread-50) [6ba07d19] Running command: MaintananceNumberOfVdssCommand internal: false. Entities affected :  ID: 089f147e-7b5c-11e2-a3e3-00145e8327d8 Type: VDS
2013-06-19 11:13:21,534 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-3-thread-50) [6ba07d19] START, SetVdsStatusVDSCommand(HostName = buri04.ci.lab.tlv.redhat.com, HostId = 089f147e-7b5c-11e2-a3e3-00145e8327d8, status=PreparingForMaintenance, nonOperationalReason=NONE), log id: 5c8420ec
2013-06-19 11:13:21,558 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-3-thread-50) [6ba07d19] FINISH, SetVdsStatusVDSCommand, log id: 5c8420ec
2013-06-19 11:13:21,608 INFO  [org.ovirt.engine.core.bll.MaintananceVdsCommand] (pool-3-thread-50) [6ba07d19] Running command: MaintananceVdsCommand internal: true. Entities affected :  ID: 089f147e-7b5c-11e2-a3e3-00145e8327d8 Type: VDS
2013-06-19 11:13:21,613 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-3-thread-50) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.bll.MaintananceVdsCommand
2013-06-19 11:13:21,613 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-3-thread-50) Unable to get value of property: vds for class org.ovirt.engine.core.bll.MaintananceVdsCommand
2013-06-19 11:13:21,935 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-1) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
2013-06-19 11:13:24,045 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-96) Updated vds status from Preparing for Maintenance to Maintenance in database,  vds = 089f147e-7b5c-11e2-a3e3-00145e8327d8 : buri04.ci.lab.tlv.redhat.com
2013-06-19 11:13:24,105 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-3-thread-50) Clearing cache of pool: 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6 for problematic entities of VDS: buri04.ci.lab.tlv.redhat.com.
2013-06-19 11:13:24,113 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (QuartzScheduler_Worker-96) START, DisconnectStoragePoolVDSCommand(HostName = buri04.ci.lab.tlv.redhat.com, HostId = 089f147e-7b5c-11e2-a3e3-00145e8327d8, storagePoolId = 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6, vds_spm_id = 2), log id: 14
-------------------------------------------------------------

the bug is that unlike other scenarios, setStoragePoolStatus that is being executed during the errors flows isn't being executed in that scenario.

Comment 3 Liron Aravot 2013-06-25 10:43:48 UTC
bad wording in the last sentence - bottom line, we should make sure that when all hosts are in maintenance mode there are no active domains as it can cause to issues like preventing host from being activated.

Comment 7 Yair Zaslavsky 2013-08-14 07:55:32 UTC
Make some sense to me.

For example,
IrsBrokerCommand.gethostFromVds has a specific part where if the previous status
is any status either than NonResponsive, the storagePoolStatusChange event will be sent.

Need to check other occurances at the code, and understand these limitations.

Comment 8 Allon Mureinik 2013-08-14 08:05:55 UTC
Moving to infra to investigate based on comment 7.

Comment 9 Roy Golan 2013-10-10 07:40:01 UTC
seems as the best thing to do is in the spm selection area
check on spm selection failure if there any other host UP -
if NOT -> change the pool status

Comment 10 Barak 2013-11-03 13:12:18 UTC
*** Bug 1023145 has been marked as a duplicate of this bug. ***

Comment 11 Ayal Baron 2013-12-18 09:32:31 UTC
Liron, what's the state of this? the link to the patch is broken.

Comment 12 Liron Aravot 2013-12-18 09:56:14 UTC
Ayal,
the patch provided in the tracker is a draft submitted by rgolan when he was handling the bug and was abandoned by him when it was moved to me.

Returning to assigned and removing the tracker.

Comment 13 Ayal Baron 2013-12-18 11:10:05 UTC
So what needs to be done here?

Comment 14 Liron Aravot 2013-12-26 11:54:15 UTC
I'll take further look into this - 
basically the issue is that the spm is non responsive and we failed to fence it, domains move to up as other hosts connects to the pool succesfully, when the hsms are being moved to maintenance the domains stay up.
we already have a command that should move all the domains to unknown, it isn't being called for some reason in that case - possibly that would be the simple solution..i'll take a look in other directions as well.

Comment 15 Ayal Baron 2014-02-16 10:31:12 UTC
Any update on this issue?

Comment 16 Liron Aravot 2014-03-09 10:58:04 UTC
Added a gerrit tracker to resolve the issue.
When there are no hosts that provide domains report, the domains status should be 'unknown' as we don't have monitoring to rely on.

Comment 19 Ori Gofen 2014-05-19 13:11:57 UTC
verified on av9.1
steps taken:
add iptable rule from host to engine and from engine to host.
wait from host to become none responsive,add another host,
force storage domains to be up with psql , maintenance the host.
domains became state unknown.

Comment 20 Itamar Heim 2014-06-12 14:09:46 UTC
Closing as part of 3.4.0