Bug 975742
Summary: | After power outage, storage reported as up while none of the hosts are up | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Ohad Basan <obasan> |
Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ori Gofen <ogofen> |
Severity: | unspecified | Docs Contact: | |
Priority: | high | ||
Version: | 3.2.0 | CC: | acanan, acathrow, amureini, bazulay, eedri, iheim, jkt, laravot, lpeer, mgoldboi, Rhev-m-bugs, tnisan, yeylon, yzaslavs |
Target Milestone: | --- | Flags: | laravot:
needinfo-
|
Target Release: | 3.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | storage | ||
Fixed In Version: | av5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1090946 |
Description
Ohad Basan
2013-06-19 09:09:49 UTC
This issue seems spm selection/host maintenance related - IrsBrokerCommand flow. apparently there was a pool with one inactive domain, 5ed93bcd-753b-4175-b33e-ce3833ff0ada -------------------------------------------------------- [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (QuartzScheduler_Worker-73) [7d05f9f9] Running command: ConnectDomainToStorag eCommand internal: true. Entities affected : ID: 5ed93bcd-753b-4175-b33e-ce3833ff0ada Type: Storage 2013-06-19 11:10:00,023 INFO [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (QuartzScheduler_Worker-73) [7d05f9f9] ConnectDomainToStorage. Before Connect all hosts to pool. Time:6/19/13 11:10 AM 2013-06-19 11:10:00,036 INFO [org.ovirt.engine.core.bll.storage.ConnectDomainToStorageCommand] (QuartzScheduler_Worker-73) [7d05f9f9] ConnectDomainToStorage. After Connect all hosts to pool. Time:6/19/13 11:10 AM ---------------------------------------------------------- later on a host is being activated, buri-04 --------------------------------------------------------- 2013-06-19 11:12:07,889 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-40) [4f5a18f4] Host buri04.ci.lab.tlv.redhat .com storage connection was succeeded 2013-06-19 11:12:07,901 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-49) START, ConnectStoragePoolVDSCommand(HostName = buri04.ci.lab.tlv.redhat.com, HostId = 089f147e-7b5c-11e2-a3e3-00145e8327d8, storagePoolId = 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6, vds_spm_id = 2, masterDomainId = 5ed93bcd-753b-4175-b33e-ce3833ff0ada, masterVersion = 1), log id: 41683e52 ---------------------------------------------------------------- the host succesfully connects to the pool, and as it sees the domain, the domain moves to active status from inactive ---------------------------------------------------------------- ntities affected : ID: 089f147e-7b5c-11e2-a3e3-00145e8327d8 Type: VDS 2013-06-19 11:12:14,240 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-62) Storage Domain 5ed93bcd-753b-4175-b33e-ce3833ff0ada:xtremio-data-1 was reported by Host buri04.ci.lab.tlv.redhat.com as Active in Pool 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6, moving to active status ---------------------------------------------- spm selection later on fails because another host buri03 seems as spm ----------------------------------------------- 2013-06-19 11:12:15,879 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) hostFromVds::selectedVds - buri04.ci.lab.tlv.redhat.com, spmStatus Free, storage pool Xtremio 2013-06-19 11:12:15,883 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) SPM Init: could not find reported vds or not up - pool:Xtremio vds_spm_id: 1 2013-06-19 11:12:15,914 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) SPM selection - vds seems as spm buri03.ci.lab.tlv.redhat.com 2013-06-19 11:12:15,914 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-70) spm vds is non responsive, stopping spm selection. 2013-06-19 11:12:25,934 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] hostFromVds::selectedVds - buri04.ci.lab.tlv.redhat.com, spmStatus Free, storage pool Xtremio 2013-06-19 11:12:25,936 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] SPM Init: could not find reported vds or not up - pool:Xtremio vds_spm_id: 1 2013-06-19 11:12:25,939 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] SPM selection - vds seems as spm buri03.ci.lab.tlv.redhat.com 2013-06-19 11:12:25,939 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-68) [3a1abaa6] spm vds is non responsive, stopping spm selection. ------------------------------------------------------------- buri04 is being moved to maintenance ------------------------------------------------------------- 2013-06-19 11:13:21,531 INFO [org.ovirt.engine.core.bll.MaintananceNumberOfVdssCommand] (pool-3-thread-50) [6ba07d19] Running command: MaintananceNumberOfVdssCommand internal: false. Entities affected : ID: 089f147e-7b5c-11e2-a3e3-00145e8327d8 Type: VDS 2013-06-19 11:13:21,534 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-3-thread-50) [6ba07d19] START, SetVdsStatusVDSCommand(HostName = buri04.ci.lab.tlv.redhat.com, HostId = 089f147e-7b5c-11e2-a3e3-00145e8327d8, status=PreparingForMaintenance, nonOperationalReason=NONE), log id: 5c8420ec 2013-06-19 11:13:21,558 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-3-thread-50) [6ba07d19] FINISH, SetVdsStatusVDSCommand, log id: 5c8420ec 2013-06-19 11:13:21,608 INFO [org.ovirt.engine.core.bll.MaintananceVdsCommand] (pool-3-thread-50) [6ba07d19] Running command: MaintananceVdsCommand internal: true. Entities affected : ID: 089f147e-7b5c-11e2-a3e3-00145e8327d8 Type: VDS 2013-06-19 11:13:21,613 WARN [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-3-thread-50) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.bll.MaintananceVdsCommand 2013-06-19 11:13:21,613 WARN [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-3-thread-50) Unable to get value of property: vds for class org.ovirt.engine.core.bll.MaintananceVdsCommand 2013-06-19 11:13:21,935 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-1) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused 2013-06-19 11:13:24,045 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-96) Updated vds status from Preparing for Maintenance to Maintenance in database, vds = 089f147e-7b5c-11e2-a3e3-00145e8327d8 : buri04.ci.lab.tlv.redhat.com 2013-06-19 11:13:24,105 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-3-thread-50) Clearing cache of pool: 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6 for problematic entities of VDS: buri04.ci.lab.tlv.redhat.com. 2013-06-19 11:13:24,113 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (QuartzScheduler_Worker-96) START, DisconnectStoragePoolVDSCommand(HostName = buri04.ci.lab.tlv.redhat.com, HostId = 089f147e-7b5c-11e2-a3e3-00145e8327d8, storagePoolId = 2f72d8be-5826-4f08-a19f-3bc0b8ed28e6, vds_spm_id = 2), log id: 14 ------------------------------------------------------------- the bug is that unlike other scenarios, setStoragePoolStatus that is being executed during the errors flows isn't being executed in that scenario. bad wording in the last sentence - bottom line, we should make sure that when all hosts are in maintenance mode there are no active domains as it can cause to issues like preventing host from being activated. Make some sense to me. For example, IrsBrokerCommand.gethostFromVds has a specific part where if the previous status is any status either than NonResponsive, the storagePoolStatusChange event will be sent. Need to check other occurances at the code, and understand these limitations. seems as the best thing to do is in the spm selection area check on spm selection failure if there any other host UP - if NOT -> change the pool status *** Bug 1023145 has been marked as a duplicate of this bug. *** Liron, what's the state of this? the link to the patch is broken. Ayal, the patch provided in the tracker is a draft submitted by rgolan when he was handling the bug and was abandoned by him when it was moved to me. Returning to assigned and removing the tracker. So what needs to be done here? I'll take further look into this - basically the issue is that the spm is non responsive and we failed to fence it, domains move to up as other hosts connects to the pool succesfully, when the hsms are being moved to maintenance the domains stay up. we already have a command that should move all the domains to unknown, it isn't being called for some reason in that case - possibly that would be the simple solution..i'll take a look in other directions as well. Any update on this issue? Added a gerrit tracker to resolve the issue. When there are no hosts that provide domains report, the domains status should be 'unknown' as we don't have monitoring to rely on. verified on av9.1 steps taken: add iptable rule from host to engine and from engine to host. wait from host to become none responsive,add another host, force storage domains to be up with psql , maintenance the host. domains became state unknown. Closing as part of 3.4.0 |